Project Overview¶

Objective: Prepare a prototype of a model to predict the amount of gold recovered from gold ore to help optimize the production and eliminate unprofitable parameters.

  • Data provided is on extraction & purification

Steps to Perform:

  • Prepare data
  • Perform Analysis
  • Develop/Train Model

Technical Understanding¶

How Gold is Extracted from Ore:

  • Mine the ore
  • Primary Pocessing to get finely ground gold/ore mixture suspended in water
  • Floatation
    • Rougher feed fed into flotation banks to obtain rougher Au concentrate and rougher tailings
    • The stability of this process is affected by the volatile and non-optimal physicochemical state of the flotation pulp
  • Two-Stage Purification (Cleaning rougher concentrate)
    • Stage 1 of Cleaning
    • Stage 2 of Cleaning
  • Final Concentrate & New Tails

image.png

Terminology Simplified:

  • Raw Material = Mined Ore
  • Primary Processing = crushing and grinding
  • Finely Ground Gold Ore Mixture Suspended in Water = Pulp/Slurry
    • the Pulp/Slurry is made from the grounded ore & mixed with water and, often reagents; BEFORE fed into flotation bank
  • Ore Pulp/Slurry = Rougher Feed
  • Rougher Additions (Reagent Additions): flotation reagents added to the pulp/slurry (rougher feed) during conditioning or just before entering the rougher flotation bank. - Xanthate – collector (promoter/activator for sulfide mineral flotation) - Sodium Sulphide (Na₂S) – used here as a sulphidizing agent to improve flotation of certain minerals - Sodium Silicate – depressant (suppresses gangue minerals like silicates)
  • Floataion = Rougher Process
  • Rougher Au Concentrate = Product with Higher Concentration of Gold Particles Separated from Waste Minerals in the Ore
  • Rougher Tailings = Product Residue with Low Concentration of Valuable Metal
  • Flotation Pulp = Mixture of Solid Particles and Liquid
    • Solid Particles = the finely ground ore (containing both valuable minerals like gold and waste minerals)
    • Liquid = water plus any chemical reagents (collectors, frothers, modifiers)
  • Two-Stage Purification = Cleaning Rougher Concentrate
  • Final Concentrate = concentrated ore containing valuable metals (not yet refined gold/metal)
  • New Tails = waste material after extraction

Additional Terminology:

  • air amount = volume of air
  • feed size = feed particle size

Environment Setup & Required Libraries¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error

Download & Prepare the Data¶

In [2]:
# Download the Data

gold_train = pd.read_csv('/datasets/gold_recovery_train.csv')
gold_test = pd.read_csv('/datasets/gold_recovery_test.csv')
gold_full = pd.read_csv('/datasets/gold_recovery_full.csv')
In [3]:
# View the data

display(gold_train)
display(gold_test)
display(gold_full)
date final.output.concentrate_ag final.output.concentrate_pb final.output.concentrate_sol final.output.concentrate_au final.output.recovery final.output.tail_ag final.output.tail_pb final.output.tail_sol final.output.tail_au ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
0 2016-01-15 00:00:00 6.055403 9.889648 5.507324 42.192020 70.541216 10.411962 0.895447 16.904297 2.143149 ... 14.016835 -502.488007 12.099931 -504.715942 9.925633 -498.310211 8.079666 -500.470978 14.151341 -605.841980
1 2016-01-15 01:00:00 6.029369 9.968944 5.257781 42.701629 69.266198 10.462676 0.927452 16.634514 2.224930 ... 13.992281 -505.503262 11.950531 -501.331529 10.039245 -500.169983 7.984757 -500.582168 13.998353 -599.787184
2 2016-01-15 02:00:00 6.055926 10.213995 5.383759 42.657501 68.116445 10.507046 0.953716 16.208849 2.257889 ... 14.015015 -502.520901 11.912783 -501.133383 10.070913 -500.129135 8.013877 -500.517572 14.028663 -601.427363
3 2016-01-15 03:00:00 6.047977 9.977019 4.858634 42.689819 68.347543 10.422762 0.883763 16.532835 2.146849 ... 14.036510 -500.857308 11.999550 -501.193686 9.970366 -499.201640 7.977324 -500.255908 14.005551 -599.996129
4 2016-01-15 04:00:00 6.148599 10.142511 4.939416 42.774141 66.927016 10.360302 0.792826 16.525686 2.055292 ... 14.027298 -499.838632 11.953070 -501.053894 9.925709 -501.686727 7.894242 -500.356035 13.996647 -601.496691
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
16855 2018-08-18 06:59:59 3.224920 11.356233 6.803482 46.713954 73.755150 8.769645 3.141541 10.403181 1.529220 ... 23.031497 -501.167942 20.007571 -499.740028 18.006038 -499.834374 13.001114 -500.155694 20.007840 -501.296428
16856 2018-08-18 07:59:59 3.195978 11.349355 6.862249 46.866780 69.049291 8.897321 3.130493 10.549470 1.612542 ... 22.960095 -501.612783 20.035660 -500.251357 17.998535 -500.395178 12.954048 -499.895163 19.968498 -501.041608
16857 2018-08-18 08:59:59 3.109998 11.434366 6.886013 46.795691 67.002189 8.529606 2.911418 11.115147 1.596616 ... 23.015718 -501.711599 19.951231 -499.857027 18.019543 -500.451156 13.023431 -499.914391 19.990885 -501.518452
16858 2018-08-18 09:59:59 3.367241 11.625587 6.799433 46.408188 65.523246 8.777171 2.819214 10.463847 1.602879 ... 23.024963 -501.153409 20.054122 -500.314711 17.979515 -499.272871 12.992404 -499.976268 20.013986 -500.625471
16859 2018-08-18 10:59:59 3.598375 11.737832 6.717509 46.299438 70.281454 8.406690 2.517518 10.652193 1.389434 ... 23.018622 -500.492702 20.020205 -500.220296 17.963512 -499.939490 12.990306 -500.080993 19.990336 -499.191575

16860 rows × 87 columns

date primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level primary_cleaner.state.floatbank8_c_air ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
0 2016-09-01 00:59:59 210.800909 14.993118 8.080000 1.005021 1398.981301 -500.225577 1399.144926 -499.919735 1400.102998 ... 12.023554 -497.795834 8.016656 -501.289139 7.946562 -432.317850 4.872511 -500.037437 26.705889 -499.709414
1 2016-09-01 01:59:59 215.392455 14.987471 8.080000 0.990469 1398.777912 -500.057435 1398.055362 -499.778182 1396.151033 ... 12.058140 -498.695773 8.130979 -499.634209 7.958270 -525.839648 4.878850 -500.162375 25.019940 -499.819438
2 2016-09-01 02:59:59 215.259946 12.884934 7.786667 0.996043 1398.493666 -500.868360 1398.860436 -499.764529 1398.075709 ... 11.962366 -498.767484 8.096893 -500.827423 8.071056 -500.801673 4.905125 -499.828510 24.994862 -500.622559
3 2016-09-01 03:59:59 215.336236 12.006805 7.640000 0.863514 1399.618111 -498.863574 1397.440120 -499.211024 1400.129303 ... 12.033091 -498.350935 8.074946 -499.474407 7.897085 -500.868509 4.931400 -499.963623 24.948919 -498.709987
4 2016-09-01 04:59:59 199.099327 10.682530 7.530000 0.805575 1401.268123 -500.808305 1398.128818 -499.504543 1402.172226 ... 12.025367 -500.786497 8.054678 -500.397500 8.107890 -509.526725 4.957674 -500.360026 25.003331 -500.856333
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5851 2017-12-31 19:59:59 173.957757 15.963399 8.070000 0.896701 1401.930554 -499.728848 1401.441445 -499.193423 1399.810313 ... 13.995957 -500.157454 12.069155 -499.673279 7.977259 -499.516126 5.933319 -499.965973 8.987171 -499.755909
5852 2017-12-31 20:59:59 172.910270 16.002605 8.070000 0.896519 1447.075722 -494.716823 1448.851892 -465.963026 1443.890424 ... 16.749781 -496.031539 13.365371 -499.122723 9.288553 -496.892967 7.372897 -499.942956 8.986832 -499.903761
5853 2017-12-31 21:59:59 171.135718 15.993669 8.070000 1.165996 1498.836182 -501.770403 1499.572353 -495.516347 1502.749213 ... 19.994130 -499.791312 15.101425 -499.936252 10.989181 -498.347898 9.020944 -500.040448 8.982038 -497.789882
5854 2017-12-31 22:59:59 179.697158 15.438979 8.070000 1.501068 1498.466243 -500.483984 1497.986986 -519.200340 1496.569047 ... 19.958760 -499.958750 15.026853 -499.723143 11.011607 -499.985046 9.009783 -499.937902 9.012660 -500.154284
5855 2017-12-31 23:59:59 181.556856 14.995850 8.070000 1.623454 1498.096303 -499.796922 1501.743791 -505.146931 1499.535978 ... 20.034715 -500.728588 14.914199 -499.948518 10.986607 -500.658027 8.989497 -500.337588 8.988632 -500.764937

5856 rows × 53 columns

date final.output.concentrate_ag final.output.concentrate_pb final.output.concentrate_sol final.output.concentrate_au final.output.recovery final.output.tail_ag final.output.tail_pb final.output.tail_sol final.output.tail_au ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
0 2016-01-15 00:00:00 6.055403 9.889648 5.507324 42.192020 70.541216 10.411962 0.895447 16.904297 2.143149 ... 14.016835 -502.488007 12.099931 -504.715942 9.925633 -498.310211 8.079666 -500.470978 14.151341 -605.841980
1 2016-01-15 01:00:00 6.029369 9.968944 5.257781 42.701629 69.266198 10.462676 0.927452 16.634514 2.224930 ... 13.992281 -505.503262 11.950531 -501.331529 10.039245 -500.169983 7.984757 -500.582168 13.998353 -599.787184
2 2016-01-15 02:00:00 6.055926 10.213995 5.383759 42.657501 68.116445 10.507046 0.953716 16.208849 2.257889 ... 14.015015 -502.520901 11.912783 -501.133383 10.070913 -500.129135 8.013877 -500.517572 14.028663 -601.427363
3 2016-01-15 03:00:00 6.047977 9.977019 4.858634 42.689819 68.347543 10.422762 0.883763 16.532835 2.146849 ... 14.036510 -500.857308 11.999550 -501.193686 9.970366 -499.201640 7.977324 -500.255908 14.005551 -599.996129
4 2016-01-15 04:00:00 6.148599 10.142511 4.939416 42.774141 66.927016 10.360302 0.792826 16.525686 2.055292 ... 14.027298 -499.838632 11.953070 -501.053894 9.925709 -501.686727 7.894242 -500.356035 13.996647 -601.496691
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
22711 2018-08-18 06:59:59 3.224920 11.356233 6.803482 46.713954 73.755150 8.769645 3.141541 10.403181 1.529220 ... 23.031497 -501.167942 20.007571 -499.740028 18.006038 -499.834374 13.001114 -500.155694 20.007840 -501.296428
22712 2018-08-18 07:59:59 3.195978 11.349355 6.862249 46.866780 69.049291 8.897321 3.130493 10.549470 1.612542 ... 22.960095 -501.612783 20.035660 -500.251357 17.998535 -500.395178 12.954048 -499.895163 19.968498 -501.041608
22713 2018-08-18 08:59:59 3.109998 11.434366 6.886013 46.795691 67.002189 8.529606 2.911418 11.115147 1.596616 ... 23.015718 -501.711599 19.951231 -499.857027 18.019543 -500.451156 13.023431 -499.914391 19.990885 -501.518452
22714 2018-08-18 09:59:59 3.367241 11.625587 6.799433 46.408188 65.523246 8.777171 2.819214 10.463847 1.602879 ... 23.024963 -501.153409 20.054122 -500.314711 17.979515 -499.272871 12.992404 -499.976268 20.013986 -500.625471
22715 2018-08-18 10:59:59 3.598375 11.737832 6.717509 46.299438 70.281454 8.406690 2.517518 10.652193 1.389434 ... 23.018622 -500.492702 20.020205 -500.220296 17.963512 -499.939490 12.990306 -500.080993 19.990336 -499.191575

22716 rows × 87 columns

In [4]:
# Understand the data
gold_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16860 entries, 0 to 16859
Data columns (total 87 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   date                                                16860 non-null  object 
 1   final.output.concentrate_ag                         16788 non-null  float64
 2   final.output.concentrate_pb                         16788 non-null  float64
 3   final.output.concentrate_sol                        16490 non-null  float64
 4   final.output.concentrate_au                         16789 non-null  float64
 5   final.output.recovery                               15339 non-null  float64
 6   final.output.tail_ag                                16794 non-null  float64
 7   final.output.tail_pb                                16677 non-null  float64
 8   final.output.tail_sol                               16715 non-null  float64
 9   final.output.tail_au                                16794 non-null  float64
 10  primary_cleaner.input.sulfate                       15553 non-null  float64
 11  primary_cleaner.input.depressant                    15598 non-null  float64
 12  primary_cleaner.input.feed_size                     16860 non-null  float64
 13  primary_cleaner.input.xanthate                      15875 non-null  float64
 14  primary_cleaner.output.concentrate_ag               16778 non-null  float64
 15  primary_cleaner.output.concentrate_pb               16502 non-null  float64
 16  primary_cleaner.output.concentrate_sol              16224 non-null  float64
 17  primary_cleaner.output.concentrate_au               16778 non-null  float64
 18  primary_cleaner.output.tail_ag                      16777 non-null  float64
 19  primary_cleaner.output.tail_pb                      16761 non-null  float64
 20  primary_cleaner.output.tail_sol                     16579 non-null  float64
 21  primary_cleaner.output.tail_au                      16777 non-null  float64
 22  primary_cleaner.state.floatbank8_a_air              16820 non-null  float64
 23  primary_cleaner.state.floatbank8_a_level            16827 non-null  float64
 24  primary_cleaner.state.floatbank8_b_air              16820 non-null  float64
 25  primary_cleaner.state.floatbank8_b_level            16833 non-null  float64
 26  primary_cleaner.state.floatbank8_c_air              16822 non-null  float64
 27  primary_cleaner.state.floatbank8_c_level            16833 non-null  float64
 28  primary_cleaner.state.floatbank8_d_air              16821 non-null  float64
 29  primary_cleaner.state.floatbank8_d_level            16833 non-null  float64
 30  rougher.calculation.sulfate_to_au_concentrate       16833 non-null  float64
 31  rougher.calculation.floatbank10_sulfate_to_au_feed  16833 non-null  float64
 32  rougher.calculation.floatbank11_sulfate_to_au_feed  16833 non-null  float64
 33  rougher.calculation.au_pb_ratio                     15618 non-null  float64
 34  rougher.input.feed_ag                               16778 non-null  float64
 35  rougher.input.feed_pb                               16632 non-null  float64
 36  rougher.input.feed_rate                             16347 non-null  float64
 37  rougher.input.feed_size                             16443 non-null  float64
 38  rougher.input.feed_sol                              16568 non-null  float64
 39  rougher.input.feed_au                               16777 non-null  float64
 40  rougher.input.floatbank10_sulfate                   15816 non-null  float64
 41  rougher.input.floatbank10_xanthate                  16514 non-null  float64
 42  rougher.input.floatbank11_sulfate                   16237 non-null  float64
 43  rougher.input.floatbank11_xanthate                  14956 non-null  float64
 44  rougher.output.concentrate_ag                       16778 non-null  float64
 45  rougher.output.concentrate_pb                       16778 non-null  float64
 46  rougher.output.concentrate_sol                      16698 non-null  float64
 47  rougher.output.concentrate_au                       16778 non-null  float64
 48  rougher.output.recovery                             14287 non-null  float64
 49  rougher.output.tail_ag                              14610 non-null  float64
 50  rougher.output.tail_pb                              16778 non-null  float64
 51  rougher.output.tail_sol                             14611 non-null  float64
 52  rougher.output.tail_au                              14611 non-null  float64
 53  rougher.state.floatbank10_a_air                     16807 non-null  float64
 54  rougher.state.floatbank10_a_level                   16807 non-null  float64
 55  rougher.state.floatbank10_b_air                     16807 non-null  float64
 56  rougher.state.floatbank10_b_level                   16807 non-null  float64
 57  rougher.state.floatbank10_c_air                     16807 non-null  float64
 58  rougher.state.floatbank10_c_level                   16814 non-null  float64
 59  rougher.state.floatbank10_d_air                     16802 non-null  float64
 60  rougher.state.floatbank10_d_level                   16809 non-null  float64
 61  rougher.state.floatbank10_e_air                     16257 non-null  float64
 62  rougher.state.floatbank10_e_level                   16809 non-null  float64
 63  rougher.state.floatbank10_f_air                     16802 non-null  float64
 64  rougher.state.floatbank10_f_level                   16802 non-null  float64
 65  secondary_cleaner.output.tail_ag                    16776 non-null  float64
 66  secondary_cleaner.output.tail_pb                    16764 non-null  float64
 67  secondary_cleaner.output.tail_sol                   14874 non-null  float64
 68  secondary_cleaner.output.tail_au                    16778 non-null  float64
 69  secondary_cleaner.state.floatbank2_a_air            16497 non-null  float64
 70  secondary_cleaner.state.floatbank2_a_level          16751 non-null  float64
 71  secondary_cleaner.state.floatbank2_b_air            16705 non-null  float64
 72  secondary_cleaner.state.floatbank2_b_level          16748 non-null  float64
 73  secondary_cleaner.state.floatbank3_a_air            16763 non-null  float64
 74  secondary_cleaner.state.floatbank3_a_level          16747 non-null  float64
 75  secondary_cleaner.state.floatbank3_b_air            16752 non-null  float64
 76  secondary_cleaner.state.floatbank3_b_level          16750 non-null  float64
 77  secondary_cleaner.state.floatbank4_a_air            16731 non-null  float64
 78  secondary_cleaner.state.floatbank4_a_level          16747 non-null  float64
 79  secondary_cleaner.state.floatbank4_b_air            16768 non-null  float64
 80  secondary_cleaner.state.floatbank4_b_level          16767 non-null  float64
 81  secondary_cleaner.state.floatbank5_a_air            16775 non-null  float64
 82  secondary_cleaner.state.floatbank5_a_level          16775 non-null  float64
 83  secondary_cleaner.state.floatbank5_b_air            16775 non-null  float64
 84  secondary_cleaner.state.floatbank5_b_level          16776 non-null  float64
 85  secondary_cleaner.state.floatbank6_a_air            16757 non-null  float64
 86  secondary_cleaner.state.floatbank6_a_level          16775 non-null  float64
dtypes: float64(86), object(1)
memory usage: 11.2+ MB

Recovery & MAE Calculations¶

Recovery Formula: simulates the process of recovering gold from gold ore

image.png

C : share of gold in the concentrate

  • right after floatation (finds the rougher concentrate recovery) - [rougher.output.concentrate_au]
  • right after purification (finds the final concentrate recovery) - [final.output.concentrate_au]

F : share of gold in the feed

  • right before floatation (finds rougher) - [rougher.input.feed_au]
  • right after floatation (finds final) - [rougher.output.concentrate_au]

T : share of gold in the rougher tails

  • right after floatation (finds rougher) - [rougher.output.tail_au]
  • right after purification (finds final) - [final.output.tail_au]

image.png

In [5]:
# Check that recovery is calculated correctly, from the training set, using `rougher.output.recovery`

# Define your rougher recovery function
def rougher_recovery(df):
    """Calculate recovery for a given dataframe."""

    # extract needed columns

    c = df['rougher.output.concentrate_au'] 
    f = df['rougher.input.feed_au'] 
    t = df['rougher.output.tail_au'] 
  


    # recovery formula
    rougher_recovery = ((c * (f - t)) / (f * (c - t))) * 100

    return rougher_recovery

display(rougher_recovery(gold_train))

# Define your final recovery function

def final_recovery(df):
    """Calculate recovery for a given dataframe."""

    # extract needed columns

    c1 = df['final.output.concentrate_au']
    f1 = df['rougher.output.concentrate_au']
    t1 = df['final.output.tail_au']


    # recovery formula
    final_recovery = ((c1 * (f1 - t1)) / (f1 * (c1 - t1))) * 100

    return final_recovery

display(final_recovery(gold_train))
0        87.107763
1        86.843261
2        86.842308
3        87.226430
4        86.688794
           ...    
16855    89.574376
16856    87.724007
16857    88.890579
16858    89.858126
16859    89.514960
Length: 16860, dtype: float64
0        93.944554
1        93.790501
2        93.509750
3        93.595268
4        93.811976
           ...    
16855    94.886132
16856    94.507593
16857    92.593426
16858    94.268532
16859    95.048379
Length: 16860, dtype: float64
In [6]:
# Define a function that shows how many correct/True or incorrect/False recovery calculations there are:

def bool_recovery(df):

    boolean = df['rougher.output.recovery'] == rougher_recovery(df)
    bool_count = boolean.value_counts()
    boolean1 = df['final.output.recovery'] == final_recovery(df)
    bool_count1 = boolean1.value_counts()

    return print(f"The Rougher Recovery Calculations Are: \n{bool_count}\n\nThe Final Recovery Calculations Are: \n{bool_count1}")


bool_recovery(gold_train)
    
The Rougher Recovery Calculations Are: 
False    10002
True      6858
dtype: int64

The Final Recovery Calculations Are: 
False    16314
True       546
dtype: int64
In [7]:
# Get the MAE values for the predicted (rougher_recovery()) and the actual (gold_train['rougher.output.recovery']) values.

predicted = rougher_recovery(gold_train)
actual = gold_train['rougher.output.recovery']

display(len(gold_train))
display(predicted.isna().sum())
display(actual.isna().sum())

# Fix the missing values to calculate MAE

# 1. Combine into one DF
pred_act_df = pd.DataFrame({'predicted_values_train': predicted, 'actual_values_train': actual})
display(pred_act_df)

# 2. Drop the NaN values
pred_act_df = pred_act_df.dropna()
display(pred_act_df)

# 3. Compute MAE
mae = mean_absolute_error(pred_act_df['actual_values_train'], pred_act_df['predicted_values_train'])
print(f"MAE for Rougher Recovery: {mae}")
16860
2283
2573
predicted_values_train actual_values_train
0 87.107763 87.107763
1 86.843261 86.843261
2 86.842308 86.842308
3 87.226430 87.226430
4 86.688794 86.688794
... ... ...
16855 89.574376 89.574376
16856 87.724007 87.724007
16857 88.890579 88.890579
16858 89.858126 89.858126
16859 89.514960 89.514960

16860 rows × 2 columns

predicted_values_train actual_values_train
0 87.107763 87.107763
1 86.843261 86.843261
2 86.842308 86.842308
3 87.226430 87.226430
4 86.688794 86.688794
... ... ...
16855 89.574376 89.574376
16856 87.724007 87.724007
16857 88.890579 88.890579
16858 89.858126 89.858126
16859 89.514960 89.514960

14287 rows × 2 columns

MAE for Rougher Recovery: 9.303415616264301e-15
In [8]:
# View the non-equal values
display(pred_act_df[pred_act_df['predicted_values_train'] != pred_act_df['actual_values_train']])
predicted_values_train actual_values_train
1 86.843261 86.843261
2 86.842308 86.842308
5 88.156912 88.156912
6 88.168065 88.168065
8 87.035862 87.035862
... ... ...
16849 91.675070 91.675070
16851 89.946627 89.946627
16854 91.816623 91.816623
16857 88.890579 88.890579
16859 89.514960 89.514960

7429 rows × 2 columns

In [9]:
gold_train['rougher.output.recovery'].isna().sum()
Out[9]:
2573
In [10]:
# View the non-equal number that differ more than 1e^-20
display(np.isclose(pred_act_df['predicted_values_train'], pred_act_df['actual_values_train'], atol=1e-20).sum())
14287

MAE Calculation & Rougher Recovery (Training Set):

Formula Accuracy Check

Metric Value Interpretation
Mean Absolute Error (MAE) 9.3e-15 ~0% - Formula perfectly matches target values
Values that differ 0 All differences are floating-point rounding errors
Values differing by > 1e-20 0 Confirms perfect formula-target alignment

Dataset Overview

Category Count Percentage Notes
Total training rows 16,860 100% Complete dataset size
Valid formula results 14,287 84.7% Rows where formula could be calculated
NaN in formula results 2,283 13.5% Due to missing required input columns
NaN in target values 2,573 15.3% Missing measurements in dataset
Rows after dropping NaNs 14,287 84.7% Final comparison dataset

Key Findings:

Finding Status Impact
Formula Accuracy ✅ Perfect Known formula perfectly reproduces target values
Data Coverage ⚠️ Good 84.7% of data usable for comparison
Missing Data Pattern ℹ️ Expected NaN values are typical in industrial datasets
Formula Reliability ✅ Excellent Zero meaningful calculation errors detected

Summary: The known formula demonstrates perfect accuracy when applied to the feature columns, with calculated values matching the target column within floating-point precision. This validates both the formula correctness and data quality for 84.7% of the dataset.

Examine Unavailable Features¶

In [11]:
# Look at the columns in the test set and the training set to analyze the missing columns in the test set

display(gold_test.info())
print()
display(gold_train.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5856 entries, 0 to 5855
Data columns (total 53 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   date                                        5856 non-null   object 
 1   primary_cleaner.input.sulfate               5554 non-null   float64
 2   primary_cleaner.input.depressant            5572 non-null   float64
 3   primary_cleaner.input.feed_size             5856 non-null   float64
 4   primary_cleaner.input.xanthate              5690 non-null   float64
 5   primary_cleaner.state.floatbank8_a_air      5840 non-null   float64
 6   primary_cleaner.state.floatbank8_a_level    5840 non-null   float64
 7   primary_cleaner.state.floatbank8_b_air      5840 non-null   float64
 8   primary_cleaner.state.floatbank8_b_level    5840 non-null   float64
 9   primary_cleaner.state.floatbank8_c_air      5840 non-null   float64
 10  primary_cleaner.state.floatbank8_c_level    5840 non-null   float64
 11  primary_cleaner.state.floatbank8_d_air      5840 non-null   float64
 12  primary_cleaner.state.floatbank8_d_level    5840 non-null   float64
 13  rougher.input.feed_ag                       5840 non-null   float64
 14  rougher.input.feed_pb                       5840 non-null   float64
 15  rougher.input.feed_rate                     5816 non-null   float64
 16  rougher.input.feed_size                     5834 non-null   float64
 17  rougher.input.feed_sol                      5789 non-null   float64
 18  rougher.input.feed_au                       5840 non-null   float64
 19  rougher.input.floatbank10_sulfate           5599 non-null   float64
 20  rougher.input.floatbank10_xanthate          5733 non-null   float64
 21  rougher.input.floatbank11_sulfate           5801 non-null   float64
 22  rougher.input.floatbank11_xanthate          5503 non-null   float64
 23  rougher.state.floatbank10_a_air             5839 non-null   float64
 24  rougher.state.floatbank10_a_level           5840 non-null   float64
 25  rougher.state.floatbank10_b_air             5839 non-null   float64
 26  rougher.state.floatbank10_b_level           5840 non-null   float64
 27  rougher.state.floatbank10_c_air             5839 non-null   float64
 28  rougher.state.floatbank10_c_level           5840 non-null   float64
 29  rougher.state.floatbank10_d_air             5839 non-null   float64
 30  rougher.state.floatbank10_d_level           5840 non-null   float64
 31  rougher.state.floatbank10_e_air             5839 non-null   float64
 32  rougher.state.floatbank10_e_level           5840 non-null   float64
 33  rougher.state.floatbank10_f_air             5839 non-null   float64
 34  rougher.state.floatbank10_f_level           5840 non-null   float64
 35  secondary_cleaner.state.floatbank2_a_air    5836 non-null   float64
 36  secondary_cleaner.state.floatbank2_a_level  5840 non-null   float64
 37  secondary_cleaner.state.floatbank2_b_air    5833 non-null   float64
 38  secondary_cleaner.state.floatbank2_b_level  5840 non-null   float64
 39  secondary_cleaner.state.floatbank3_a_air    5822 non-null   float64
 40  secondary_cleaner.state.floatbank3_a_level  5840 non-null   float64
 41  secondary_cleaner.state.floatbank3_b_air    5840 non-null   float64
 42  secondary_cleaner.state.floatbank3_b_level  5840 non-null   float64
 43  secondary_cleaner.state.floatbank4_a_air    5840 non-null   float64
 44  secondary_cleaner.state.floatbank4_a_level  5840 non-null   float64
 45  secondary_cleaner.state.floatbank4_b_air    5840 non-null   float64
 46  secondary_cleaner.state.floatbank4_b_level  5840 non-null   float64
 47  secondary_cleaner.state.floatbank5_a_air    5840 non-null   float64
 48  secondary_cleaner.state.floatbank5_a_level  5840 non-null   float64
 49  secondary_cleaner.state.floatbank5_b_air    5840 non-null   float64
 50  secondary_cleaner.state.floatbank5_b_level  5840 non-null   float64
 51  secondary_cleaner.state.floatbank6_a_air    5840 non-null   float64
 52  secondary_cleaner.state.floatbank6_a_level  5840 non-null   float64
dtypes: float64(52), object(1)
memory usage: 2.4+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16860 entries, 0 to 16859
Data columns (total 87 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   date                                                16860 non-null  object 
 1   final.output.concentrate_ag                         16788 non-null  float64
 2   final.output.concentrate_pb                         16788 non-null  float64
 3   final.output.concentrate_sol                        16490 non-null  float64
 4   final.output.concentrate_au                         16789 non-null  float64
 5   final.output.recovery                               15339 non-null  float64
 6   final.output.tail_ag                                16794 non-null  float64
 7   final.output.tail_pb                                16677 non-null  float64
 8   final.output.tail_sol                               16715 non-null  float64
 9   final.output.tail_au                                16794 non-null  float64
 10  primary_cleaner.input.sulfate                       15553 non-null  float64
 11  primary_cleaner.input.depressant                    15598 non-null  float64
 12  primary_cleaner.input.feed_size                     16860 non-null  float64
 13  primary_cleaner.input.xanthate                      15875 non-null  float64
 14  primary_cleaner.output.concentrate_ag               16778 non-null  float64
 15  primary_cleaner.output.concentrate_pb               16502 non-null  float64
 16  primary_cleaner.output.concentrate_sol              16224 non-null  float64
 17  primary_cleaner.output.concentrate_au               16778 non-null  float64
 18  primary_cleaner.output.tail_ag                      16777 non-null  float64
 19  primary_cleaner.output.tail_pb                      16761 non-null  float64
 20  primary_cleaner.output.tail_sol                     16579 non-null  float64
 21  primary_cleaner.output.tail_au                      16777 non-null  float64
 22  primary_cleaner.state.floatbank8_a_air              16820 non-null  float64
 23  primary_cleaner.state.floatbank8_a_level            16827 non-null  float64
 24  primary_cleaner.state.floatbank8_b_air              16820 non-null  float64
 25  primary_cleaner.state.floatbank8_b_level            16833 non-null  float64
 26  primary_cleaner.state.floatbank8_c_air              16822 non-null  float64
 27  primary_cleaner.state.floatbank8_c_level            16833 non-null  float64
 28  primary_cleaner.state.floatbank8_d_air              16821 non-null  float64
 29  primary_cleaner.state.floatbank8_d_level            16833 non-null  float64
 30  rougher.calculation.sulfate_to_au_concentrate       16833 non-null  float64
 31  rougher.calculation.floatbank10_sulfate_to_au_feed  16833 non-null  float64
 32  rougher.calculation.floatbank11_sulfate_to_au_feed  16833 non-null  float64
 33  rougher.calculation.au_pb_ratio                     15618 non-null  float64
 34  rougher.input.feed_ag                               16778 non-null  float64
 35  rougher.input.feed_pb                               16632 non-null  float64
 36  rougher.input.feed_rate                             16347 non-null  float64
 37  rougher.input.feed_size                             16443 non-null  float64
 38  rougher.input.feed_sol                              16568 non-null  float64
 39  rougher.input.feed_au                               16777 non-null  float64
 40  rougher.input.floatbank10_sulfate                   15816 non-null  float64
 41  rougher.input.floatbank10_xanthate                  16514 non-null  float64
 42  rougher.input.floatbank11_sulfate                   16237 non-null  float64
 43  rougher.input.floatbank11_xanthate                  14956 non-null  float64
 44  rougher.output.concentrate_ag                       16778 non-null  float64
 45  rougher.output.concentrate_pb                       16778 non-null  float64
 46  rougher.output.concentrate_sol                      16698 non-null  float64
 47  rougher.output.concentrate_au                       16778 non-null  float64
 48  rougher.output.recovery                             14287 non-null  float64
 49  rougher.output.tail_ag                              14610 non-null  float64
 50  rougher.output.tail_pb                              16778 non-null  float64
 51  rougher.output.tail_sol                             14611 non-null  float64
 52  rougher.output.tail_au                              14611 non-null  float64
 53  rougher.state.floatbank10_a_air                     16807 non-null  float64
 54  rougher.state.floatbank10_a_level                   16807 non-null  float64
 55  rougher.state.floatbank10_b_air                     16807 non-null  float64
 56  rougher.state.floatbank10_b_level                   16807 non-null  float64
 57  rougher.state.floatbank10_c_air                     16807 non-null  float64
 58  rougher.state.floatbank10_c_level                   16814 non-null  float64
 59  rougher.state.floatbank10_d_air                     16802 non-null  float64
 60  rougher.state.floatbank10_d_level                   16809 non-null  float64
 61  rougher.state.floatbank10_e_air                     16257 non-null  float64
 62  rougher.state.floatbank10_e_level                   16809 non-null  float64
 63  rougher.state.floatbank10_f_air                     16802 non-null  float64
 64  rougher.state.floatbank10_f_level                   16802 non-null  float64
 65  secondary_cleaner.output.tail_ag                    16776 non-null  float64
 66  secondary_cleaner.output.tail_pb                    16764 non-null  float64
 67  secondary_cleaner.output.tail_sol                   14874 non-null  float64
 68  secondary_cleaner.output.tail_au                    16778 non-null  float64
 69  secondary_cleaner.state.floatbank2_a_air            16497 non-null  float64
 70  secondary_cleaner.state.floatbank2_a_level          16751 non-null  float64
 71  secondary_cleaner.state.floatbank2_b_air            16705 non-null  float64
 72  secondary_cleaner.state.floatbank2_b_level          16748 non-null  float64
 73  secondary_cleaner.state.floatbank3_a_air            16763 non-null  float64
 74  secondary_cleaner.state.floatbank3_a_level          16747 non-null  float64
 75  secondary_cleaner.state.floatbank3_b_air            16752 non-null  float64
 76  secondary_cleaner.state.floatbank3_b_level          16750 non-null  float64
 77  secondary_cleaner.state.floatbank4_a_air            16731 non-null  float64
 78  secondary_cleaner.state.floatbank4_a_level          16747 non-null  float64
 79  secondary_cleaner.state.floatbank4_b_air            16768 non-null  float64
 80  secondary_cleaner.state.floatbank4_b_level          16767 non-null  float64
 81  secondary_cleaner.state.floatbank5_a_air            16775 non-null  float64
 82  secondary_cleaner.state.floatbank5_a_level          16775 non-null  float64
 83  secondary_cleaner.state.floatbank5_b_air            16775 non-null  float64
 84  secondary_cleaner.state.floatbank5_b_level          16776 non-null  float64
 85  secondary_cleaner.state.floatbank6_a_air            16757 non-null  float64
 86  secondary_cleaner.state.floatbank6_a_level          16775 non-null  float64
dtypes: float64(86), object(1)
memory usage: 11.2+ MB
None
In [12]:
# Display the number of columns in the training set and test set

display(len(gold_train.columns))
display(len(gold_test.columns))
87
53
In [13]:
# Missing values in the Training Set

with pd.option_context('display.max_rows', None):
    missing_values_train = gold_train.isna().sum()
    display(missing_values_train)
date                                                     0
final.output.concentrate_ag                             72
final.output.concentrate_pb                             72
final.output.concentrate_sol                           370
final.output.concentrate_au                             71
final.output.recovery                                 1521
final.output.tail_ag                                    66
final.output.tail_pb                                   183
final.output.tail_sol                                  145
final.output.tail_au                                    66
primary_cleaner.input.sulfate                         1307
primary_cleaner.input.depressant                      1262
primary_cleaner.input.feed_size                          0
primary_cleaner.input.xanthate                         985
primary_cleaner.output.concentrate_ag                   82
primary_cleaner.output.concentrate_pb                  358
primary_cleaner.output.concentrate_sol                 636
primary_cleaner.output.concentrate_au                   82
primary_cleaner.output.tail_ag                          83
primary_cleaner.output.tail_pb                          99
primary_cleaner.output.tail_sol                        281
primary_cleaner.output.tail_au                          83
primary_cleaner.state.floatbank8_a_air                  40
primary_cleaner.state.floatbank8_a_level                33
primary_cleaner.state.floatbank8_b_air                  40
primary_cleaner.state.floatbank8_b_level                27
primary_cleaner.state.floatbank8_c_air                  38
primary_cleaner.state.floatbank8_c_level                27
primary_cleaner.state.floatbank8_d_air                  39
primary_cleaner.state.floatbank8_d_level                27
rougher.calculation.sulfate_to_au_concentrate           27
rougher.calculation.floatbank10_sulfate_to_au_feed      27
rougher.calculation.floatbank11_sulfate_to_au_feed      27
rougher.calculation.au_pb_ratio                       1242
rougher.input.feed_ag                                   82
rougher.input.feed_pb                                  228
rougher.input.feed_rate                                513
rougher.input.feed_size                                417
rougher.input.feed_sol                                 292
rougher.input.feed_au                                   83
rougher.input.floatbank10_sulfate                     1044
rougher.input.floatbank10_xanthate                     346
rougher.input.floatbank11_sulfate                      623
rougher.input.floatbank11_xanthate                    1904
rougher.output.concentrate_ag                           82
rougher.output.concentrate_pb                           82
rougher.output.concentrate_sol                         162
rougher.output.concentrate_au                           82
rougher.output.recovery                               2573
rougher.output.tail_ag                                2250
rougher.output.tail_pb                                  82
rougher.output.tail_sol                               2249
rougher.output.tail_au                                2249
rougher.state.floatbank10_a_air                         53
rougher.state.floatbank10_a_level                       53
rougher.state.floatbank10_b_air                         53
rougher.state.floatbank10_b_level                       53
rougher.state.floatbank10_c_air                         53
rougher.state.floatbank10_c_level                       46
rougher.state.floatbank10_d_air                         58
rougher.state.floatbank10_d_level                       51
rougher.state.floatbank10_e_air                        603
rougher.state.floatbank10_e_level                       51
rougher.state.floatbank10_f_air                         58
rougher.state.floatbank10_f_level                       58
secondary_cleaner.output.tail_ag                        84
secondary_cleaner.output.tail_pb                        96
secondary_cleaner.output.tail_sol                     1986
secondary_cleaner.output.tail_au                        82
secondary_cleaner.state.floatbank2_a_air               363
secondary_cleaner.state.floatbank2_a_level             109
secondary_cleaner.state.floatbank2_b_air               155
secondary_cleaner.state.floatbank2_b_level             112
secondary_cleaner.state.floatbank3_a_air                97
secondary_cleaner.state.floatbank3_a_level             113
secondary_cleaner.state.floatbank3_b_air               108
secondary_cleaner.state.floatbank3_b_level             110
secondary_cleaner.state.floatbank4_a_air               129
secondary_cleaner.state.floatbank4_a_level             113
secondary_cleaner.state.floatbank4_b_air                92
secondary_cleaner.state.floatbank4_b_level              93
secondary_cleaner.state.floatbank5_a_air                85
secondary_cleaner.state.floatbank5_a_level              85
secondary_cleaner.state.floatbank5_b_air                85
secondary_cleaner.state.floatbank5_b_level              84
secondary_cleaner.state.floatbank6_a_air               103
secondary_cleaner.state.floatbank6_a_level              85
dtype: int64
In [14]:
# Missing values in the Test Set

missing_values_test = gold_test.isna().sum()
In [15]:
# Find the columns not in the test set

missing_test_columns = set(gold_train.columns) - set(gold_test.columns)
missing_test_columns
Out[15]:
{'final.output.concentrate_ag',
 'final.output.concentrate_au',
 'final.output.concentrate_pb',
 'final.output.concentrate_sol',
 'final.output.recovery',
 'final.output.tail_ag',
 'final.output.tail_au',
 'final.output.tail_pb',
 'final.output.tail_sol',
 'primary_cleaner.output.concentrate_ag',
 'primary_cleaner.output.concentrate_au',
 'primary_cleaner.output.concentrate_pb',
 'primary_cleaner.output.concentrate_sol',
 'primary_cleaner.output.tail_ag',
 'primary_cleaner.output.tail_au',
 'primary_cleaner.output.tail_pb',
 'primary_cleaner.output.tail_sol',
 'rougher.calculation.au_pb_ratio',
 'rougher.calculation.floatbank10_sulfate_to_au_feed',
 'rougher.calculation.floatbank11_sulfate_to_au_feed',
 'rougher.calculation.sulfate_to_au_concentrate',
 'rougher.output.concentrate_ag',
 'rougher.output.concentrate_au',
 'rougher.output.concentrate_pb',
 'rougher.output.concentrate_sol',
 'rougher.output.recovery',
 'rougher.output.tail_ag',
 'rougher.output.tail_au',
 'rougher.output.tail_pb',
 'rougher.output.tail_sol',
 'secondary_cleaner.output.tail_ag',
 'secondary_cleaner.output.tail_au',
 'secondary_cleaner.output.tail_pb',
 'secondary_cleaner.output.tail_sol'}
In [16]:
# Confirm the number of missing test columns

len(missing_test_columns)
Out[16]:
34

Training set:

  • Columns: 87
  • Rows: 16860

Test Set:

  • Columns: 53
  • Rows: 5856
  • Missing Columns: 34

Both Sets:

  • DTypes: All float except date (object)
  • Column w/ no NaN Values (2):
    • date
    • primary_cleaner.input.feed_size

Features NOT in the Test Set (34):

  • final.output.concentrate_ag | Output (Final concentrate silver)
  • final.output.concentrate_au | Output (Final concentrate gold)
  • final.output.concentrate_pb | Output (Final concentrate lead)
  • final.output.concentrate_sol | Output (Final concentrate solid)
  • final.output.recovery | Target (Final recovery target)
  • final.output.tail_ag | Output (Final tailings silver)
  • final.output.tail_au | Output (Final tailings gold)
  • final.output.tail_pb | Output (Final tailings lead)
  • final.output.tail_sol | Output (Final tailings solid)
  • primary_cleaner.output.concentrate_ag | Output (Primary cleaner concentrate silver)
  • primary_cleaner.output.concentrate_au | Output (Primary cleaner concentrate gold)
  • primary_cleaner.output.concentrate_pb | Output (Primary cleaner concentrate lead)
  • primary_cleaner.output.concentrate_sol | Output (Primary cleaner concentrate solid)
  • primary_cleaner.output.tail_ag | Output (Primary cleaner tailings silver)
  • primary_cleaner.output.tail_au | Output (Primary cleaner tailings gold)
  • primary_cleaner.output.tail_pb | Output (Primary cleaner tailings lead)
  • primary_cleaner.output.tail_sol | Output (Primary cleaner tailings solid)
  • rougher.calculation.au_pb_ratio | Calculations (Gold to lead ratio - data leakage)
  • rougher.calculation.floatbank10_sulfate_to_au_feed | Calculations (Floatbank10 sulfate to gold feed ratio - data leakage)
  • rougher.calculation.floatbank11_sulfate_to_au_feed | Calculations (Floatbank11 sulfate to gold feed ratio - data leakage)
  • rougher.calculation.sulfate_to_au_concentrate | Calculations (Sulfate to gold concentrate ratio - data leakage)
  • rougher.output.concentrate_ag | Output (Rougher concentrate silver)
  • rougher.output.concentrate_au | Output (Rougher concentrate gold)
  • rougher.output.concentrate_pb | Output (Rougher concentrate lead)
  • rougher.output.concentrate_sol | Output (Rougher concentrate solid)
  • rougher.output.recovery | Target (Rougher recovery target)
  • rougher.output.tail_ag | Output (Rougher tailings silver)
  • rougher.output.tail_au | Output (Rougher tailings gold)
  • rougher.output.tail_pb | Output (Rougher tailings lead)
  • rougher.output.tail_sol | Output (Rougher tailings solid)
  • secondary_cleaner.output.tail_ag | Output (Secondary cleaner tailings silver)
  • secondary_cleaner.output.tail_au | Output (Secondary cleaner tailings gold)
  • secondary_cleaner.output.tail_pb | Output (Secondary cleaner tailings lead)
  • secondary_cleaner.output.tail_sol | Output (Secondary cleaner tailings solid)

Summary by Parameter Type:

Parameter Type Count Reason for Exclusion
Output 28 Features only known after processing, not available at prediction time
Target 2 Used for stage-specific predictions
Calculations 4 Dependent on outputs/targets → potential data leakage

Note: All Float

Process Data¶

Known from previous code:

  • Fix date to datetime dtype (all datasets)
  • Entries:
    • Train: 16,860 x 87
    • Test: 5,856 x 53
    • Full: 22,716 x 87

Handle:

  • Missing Data
  • Duplicate Data
In [17]:
# Missing Data

# Training Set
missing_values_train

# Test Set
missing_values_test

# Full Dataset
missing_values_full = gold_full.isna().sum()
In [18]:
# Full Set missing data
with pd.option_context('display.max_rows', None):
    display(missing_values_full)
date                                                     0
final.output.concentrate_ag                             89
final.output.concentrate_pb                             87
final.output.concentrate_sol                           385
final.output.concentrate_au                             86
final.output.recovery                                 1963
final.output.tail_ag                                    83
final.output.tail_pb                                   200
final.output.tail_sol                                  271
final.output.tail_au                                    81
primary_cleaner.input.sulfate                         1609
primary_cleaner.input.depressant                      1546
primary_cleaner.input.feed_size                          0
primary_cleaner.input.xanthate                        1151
primary_cleaner.output.concentrate_ag                   98
primary_cleaner.output.concentrate_pb                  448
primary_cleaner.output.concentrate_sol                 798
primary_cleaner.output.concentrate_au                   98
primary_cleaner.output.tail_ag                         102
primary_cleaner.output.tail_pb                         122
primary_cleaner.output.tail_sol                        351
primary_cleaner.output.tail_au                          99
primary_cleaner.state.floatbank8_a_air                  56
primary_cleaner.state.floatbank8_a_level                49
primary_cleaner.state.floatbank8_b_air                  56
primary_cleaner.state.floatbank8_b_level                43
primary_cleaner.state.floatbank8_c_air                  54
primary_cleaner.state.floatbank8_c_level                43
primary_cleaner.state.floatbank8_d_air                  55
primary_cleaner.state.floatbank8_d_level                43
rougher.calculation.sulfate_to_au_concentrate           44
rougher.calculation.floatbank10_sulfate_to_au_feed      44
rougher.calculation.floatbank11_sulfate_to_au_feed      44
rougher.calculation.au_pb_ratio                       1627
rougher.input.feed_ag                                   98
rougher.input.feed_pb                                  244
rougher.input.feed_rate                                553
rougher.input.feed_size                                439
rougher.input.feed_sol                                 359
rougher.input.feed_au                                   99
rougher.input.floatbank10_sulfate                     1301
rougher.input.floatbank10_xanthate                     469
rougher.input.floatbank11_sulfate                      678
rougher.input.floatbank11_xanthate                    2257
rougher.output.concentrate_ag                           98
rougher.output.concentrate_pb                           98
rougher.output.concentrate_sol                         190
rougher.output.concentrate_au                           98
rougher.output.recovery                               3119
rougher.output.tail_ag                                2737
rougher.output.tail_pb                                  98
rougher.output.tail_sol                               2736
rougher.output.tail_au                                2736
rougher.state.floatbank10_a_air                         70
rougher.state.floatbank10_a_level                       69
rougher.state.floatbank10_b_air                         70
rougher.state.floatbank10_b_level                       69
rougher.state.floatbank10_c_air                         70
rougher.state.floatbank10_c_level                       62
rougher.state.floatbank10_d_air                         75
rougher.state.floatbank10_d_level                       67
rougher.state.floatbank10_e_air                        620
rougher.state.floatbank10_e_level                       67
rougher.state.floatbank10_f_air                         75
rougher.state.floatbank10_f_level                       74
secondary_cleaner.output.tail_ag                       100
secondary_cleaner.output.tail_pb                       116
secondary_cleaner.output.tail_sol                     2215
secondary_cleaner.output.tail_au                        98
secondary_cleaner.state.floatbank2_a_air               383
secondary_cleaner.state.floatbank2_a_level             125
secondary_cleaner.state.floatbank2_b_air               178
secondary_cleaner.state.floatbank2_b_level             128
secondary_cleaner.state.floatbank3_a_air               131
secondary_cleaner.state.floatbank3_a_level             129
secondary_cleaner.state.floatbank3_b_air               124
secondary_cleaner.state.floatbank3_b_level             126
secondary_cleaner.state.floatbank4_a_air               145
secondary_cleaner.state.floatbank4_a_level             129
secondary_cleaner.state.floatbank4_b_air               108
secondary_cleaner.state.floatbank4_b_level             109
secondary_cleaner.state.floatbank5_a_air               101
secondary_cleaner.state.floatbank5_a_level             101
secondary_cleaner.state.floatbank5_b_air               101
secondary_cleaner.state.floatbank5_b_level             100
secondary_cleaner.state.floatbank6_a_air               119
secondary_cleaner.state.floatbank6_a_level             101
dtype: int64
In [19]:
# Get the percent & count of missing values for each column (Full Dataset)
percent_missing_full = (missing_values_full.values / 22716) * 100

missing_values_full_df = pd.DataFrame(missing_values_full, columns=['miss_cnt'])
percent_missing_full_df = pd.DataFrame(percent_missing_full, 
                                       index=missing_values_full.index, 
                                       columns=['pct_missing'])

missing_values_full_copy = missing_values_full_df.copy()

missing_values_full_copy = pd.concat([missing_values_full_copy, percent_missing_full_df],axis = 1)
missing_cp_full = missing_values_full_copy
In [20]:
# Get the percent & count of missing values for each column (Training set)
percent_missing_train = (missing_values_train.values / 16860) * 100

missing_values_train_df = pd.DataFrame(missing_values_train, columns = ['miss_cnt_train'])
percent_missing_train_df = pd.DataFrame(percent_missing_train,
                                        index = missing_values_train.index,
                                       columns = ['pct_missing_train'])

missing_cp_train = pd.concat([missing_values_train_df,percent_missing_train_df],axis = 1)
missing_cp_train
Out[20]:
miss_cnt_train pct_missing_train
date 0 0.000000
final.output.concentrate_ag 72 0.427046
final.output.concentrate_pb 72 0.427046
final.output.concentrate_sol 370 2.194543
final.output.concentrate_au 71 0.421115
... ... ...
secondary_cleaner.state.floatbank5_a_level 85 0.504152
secondary_cleaner.state.floatbank5_b_air 85 0.504152
secondary_cleaner.state.floatbank5_b_level 84 0.498221
secondary_cleaner.state.floatbank6_a_air 103 0.610913
secondary_cleaner.state.floatbank6_a_level 85 0.504152

87 rows × 2 columns

In [21]:
# Get the percent & count of missing values for each column (Test set)
percent_missing_test = (missing_values_test.values / 5836) * 100

missing_values_test_df = pd.DataFrame(missing_values_test, columns = ['miss_cnt_test'])
percent_missing_test_df = pd.DataFrame(percent_missing_test,
                                      index = missing_values_test.index,
                                      columns = ['pct_missing_test'])

missing_cp_test = pd.concat([missing_values_test_df,percent_missing_test_df], axis = 1)
display(missing_cp_test)
miss_cnt_test pct_missing_test
date 0 0.000000
primary_cleaner.input.sulfate 302 5.174777
primary_cleaner.input.depressant 284 4.866347
primary_cleaner.input.feed_size 0 0.000000
primary_cleaner.input.xanthate 166 2.844414
primary_cleaner.state.floatbank8_a_air 16 0.274160
primary_cleaner.state.floatbank8_a_level 16 0.274160
primary_cleaner.state.floatbank8_b_air 16 0.274160
primary_cleaner.state.floatbank8_b_level 16 0.274160
primary_cleaner.state.floatbank8_c_air 16 0.274160
primary_cleaner.state.floatbank8_c_level 16 0.274160
primary_cleaner.state.floatbank8_d_air 16 0.274160
primary_cleaner.state.floatbank8_d_level 16 0.274160
rougher.input.feed_ag 16 0.274160
rougher.input.feed_pb 16 0.274160
rougher.input.feed_rate 40 0.685401
rougher.input.feed_size 22 0.376971
rougher.input.feed_sol 67 1.148047
rougher.input.feed_au 16 0.274160
rougher.input.floatbank10_sulfate 257 4.403701
rougher.input.floatbank10_xanthate 123 2.107608
rougher.input.floatbank11_sulfate 55 0.942426
rougher.input.floatbank11_xanthate 353 6.048663
rougher.state.floatbank10_a_air 17 0.291295
rougher.state.floatbank10_a_level 16 0.274160
rougher.state.floatbank10_b_air 17 0.291295
rougher.state.floatbank10_b_level 16 0.274160
rougher.state.floatbank10_c_air 17 0.291295
rougher.state.floatbank10_c_level 16 0.274160
rougher.state.floatbank10_d_air 17 0.291295
rougher.state.floatbank10_d_level 16 0.274160
rougher.state.floatbank10_e_air 17 0.291295
rougher.state.floatbank10_e_level 16 0.274160
rougher.state.floatbank10_f_air 17 0.291295
rougher.state.floatbank10_f_level 16 0.274160
secondary_cleaner.state.floatbank2_a_air 20 0.342700
secondary_cleaner.state.floatbank2_a_level 16 0.274160
secondary_cleaner.state.floatbank2_b_air 23 0.394106
secondary_cleaner.state.floatbank2_b_level 16 0.274160
secondary_cleaner.state.floatbank3_a_air 34 0.582591
secondary_cleaner.state.floatbank3_a_level 16 0.274160
secondary_cleaner.state.floatbank3_b_air 16 0.274160
secondary_cleaner.state.floatbank3_b_level 16 0.274160
secondary_cleaner.state.floatbank4_a_air 16 0.274160
secondary_cleaner.state.floatbank4_a_level 16 0.274160
secondary_cleaner.state.floatbank4_b_air 16 0.274160
secondary_cleaner.state.floatbank4_b_level 16 0.274160
secondary_cleaner.state.floatbank5_a_air 16 0.274160
secondary_cleaner.state.floatbank5_a_level 16 0.274160
secondary_cleaner.state.floatbank5_b_air 16 0.274160
secondary_cleaner.state.floatbank5_b_level 16 0.274160
secondary_cleaner.state.floatbank6_a_air 16 0.274160
secondary_cleaner.state.floatbank6_a_level 16 0.274160
In [22]:
# Compare missing percentages
combined_missing = pd.concat([missing_cp_full,missing_cp_train, missing_cp_test], axis = 1)

combined_missing = combined_missing.round(3)

with pd.option_context('display.max_rows', None):
    display(combined_missing)
miss_cnt pct_missing miss_cnt_train pct_missing_train miss_cnt_test pct_missing_test
date 0 0.000 0 0.000 0.0 0.000
final.output.concentrate_ag 89 0.392 72 0.427 NaN NaN
final.output.concentrate_pb 87 0.383 72 0.427 NaN NaN
final.output.concentrate_sol 385 1.695 370 2.195 NaN NaN
final.output.concentrate_au 86 0.379 71 0.421 NaN NaN
final.output.recovery 1963 8.641 1521 9.021 NaN NaN
final.output.tail_ag 83 0.365 66 0.391 NaN NaN
final.output.tail_pb 200 0.880 183 1.085 NaN NaN
final.output.tail_sol 271 1.193 145 0.860 NaN NaN
final.output.tail_au 81 0.357 66 0.391 NaN NaN
primary_cleaner.input.sulfate 1609 7.083 1307 7.752 302.0 5.175
primary_cleaner.input.depressant 1546 6.806 1262 7.485 284.0 4.866
primary_cleaner.input.feed_size 0 0.000 0 0.000 0.0 0.000
primary_cleaner.input.xanthate 1151 5.067 985 5.842 166.0 2.844
primary_cleaner.output.concentrate_ag 98 0.431 82 0.486 NaN NaN
primary_cleaner.output.concentrate_pb 448 1.972 358 2.123 NaN NaN
primary_cleaner.output.concentrate_sol 798 3.513 636 3.772 NaN NaN
primary_cleaner.output.concentrate_au 98 0.431 82 0.486 NaN NaN
primary_cleaner.output.tail_ag 102 0.449 83 0.492 NaN NaN
primary_cleaner.output.tail_pb 122 0.537 99 0.587 NaN NaN
primary_cleaner.output.tail_sol 351 1.545 281 1.667 NaN NaN
primary_cleaner.output.tail_au 99 0.436 83 0.492 NaN NaN
primary_cleaner.state.floatbank8_a_air 56 0.247 40 0.237 16.0 0.274
primary_cleaner.state.floatbank8_a_level 49 0.216 33 0.196 16.0 0.274
primary_cleaner.state.floatbank8_b_air 56 0.247 40 0.237 16.0 0.274
primary_cleaner.state.floatbank8_b_level 43 0.189 27 0.160 16.0 0.274
primary_cleaner.state.floatbank8_c_air 54 0.238 38 0.225 16.0 0.274
primary_cleaner.state.floatbank8_c_level 43 0.189 27 0.160 16.0 0.274
primary_cleaner.state.floatbank8_d_air 55 0.242 39 0.231 16.0 0.274
primary_cleaner.state.floatbank8_d_level 43 0.189 27 0.160 16.0 0.274
rougher.calculation.sulfate_to_au_concentrate 44 0.194 27 0.160 NaN NaN
rougher.calculation.floatbank10_sulfate_to_au_feed 44 0.194 27 0.160 NaN NaN
rougher.calculation.floatbank11_sulfate_to_au_feed 44 0.194 27 0.160 NaN NaN
rougher.calculation.au_pb_ratio 1627 7.162 1242 7.367 NaN NaN
rougher.input.feed_ag 98 0.431 82 0.486 16.0 0.274
rougher.input.feed_pb 244 1.074 228 1.352 16.0 0.274
rougher.input.feed_rate 553 2.434 513 3.043 40.0 0.685
rougher.input.feed_size 439 1.933 417 2.473 22.0 0.377
rougher.input.feed_sol 359 1.580 292 1.732 67.0 1.148
rougher.input.feed_au 99 0.436 83 0.492 16.0 0.274
rougher.input.floatbank10_sulfate 1301 5.727 1044 6.192 257.0 4.404
rougher.input.floatbank10_xanthate 469 2.065 346 2.052 123.0 2.108
rougher.input.floatbank11_sulfate 678 2.985 623 3.695 55.0 0.942
rougher.input.floatbank11_xanthate 2257 9.936 1904 11.293 353.0 6.049
rougher.output.concentrate_ag 98 0.431 82 0.486 NaN NaN
rougher.output.concentrate_pb 98 0.431 82 0.486 NaN NaN
rougher.output.concentrate_sol 190 0.836 162 0.961 NaN NaN
rougher.output.concentrate_au 98 0.431 82 0.486 NaN NaN
rougher.output.recovery 3119 13.730 2573 15.261 NaN NaN
rougher.output.tail_ag 2737 12.049 2250 13.345 NaN NaN
rougher.output.tail_pb 98 0.431 82 0.486 NaN NaN
rougher.output.tail_sol 2736 12.044 2249 13.339 NaN NaN
rougher.output.tail_au 2736 12.044 2249 13.339 NaN NaN
rougher.state.floatbank10_a_air 70 0.308 53 0.314 17.0 0.291
rougher.state.floatbank10_a_level 69 0.304 53 0.314 16.0 0.274
rougher.state.floatbank10_b_air 70 0.308 53 0.314 17.0 0.291
rougher.state.floatbank10_b_level 69 0.304 53 0.314 16.0 0.274
rougher.state.floatbank10_c_air 70 0.308 53 0.314 17.0 0.291
rougher.state.floatbank10_c_level 62 0.273 46 0.273 16.0 0.274
rougher.state.floatbank10_d_air 75 0.330 58 0.344 17.0 0.291
rougher.state.floatbank10_d_level 67 0.295 51 0.302 16.0 0.274
rougher.state.floatbank10_e_air 620 2.729 603 3.577 17.0 0.291
rougher.state.floatbank10_e_level 67 0.295 51 0.302 16.0 0.274
rougher.state.floatbank10_f_air 75 0.330 58 0.344 17.0 0.291
rougher.state.floatbank10_f_level 74 0.326 58 0.344 16.0 0.274
secondary_cleaner.output.tail_ag 100 0.440 84 0.498 NaN NaN
secondary_cleaner.output.tail_pb 116 0.511 96 0.569 NaN NaN
secondary_cleaner.output.tail_sol 2215 9.751 1986 11.779 NaN NaN
secondary_cleaner.output.tail_au 98 0.431 82 0.486 NaN NaN
secondary_cleaner.state.floatbank2_a_air 383 1.686 363 2.153 20.0 0.343
secondary_cleaner.state.floatbank2_a_level 125 0.550 109 0.647 16.0 0.274
secondary_cleaner.state.floatbank2_b_air 178 0.784 155 0.919 23.0 0.394
secondary_cleaner.state.floatbank2_b_level 128 0.563 112 0.664 16.0 0.274
secondary_cleaner.state.floatbank3_a_air 131 0.577 97 0.575 34.0 0.583
secondary_cleaner.state.floatbank3_a_level 129 0.568 113 0.670 16.0 0.274
secondary_cleaner.state.floatbank3_b_air 124 0.546 108 0.641 16.0 0.274
secondary_cleaner.state.floatbank3_b_level 126 0.555 110 0.652 16.0 0.274
secondary_cleaner.state.floatbank4_a_air 145 0.638 129 0.765 16.0 0.274
secondary_cleaner.state.floatbank4_a_level 129 0.568 113 0.670 16.0 0.274
secondary_cleaner.state.floatbank4_b_air 108 0.475 92 0.546 16.0 0.274
secondary_cleaner.state.floatbank4_b_level 109 0.480 93 0.552 16.0 0.274
secondary_cleaner.state.floatbank5_a_air 101 0.445 85 0.504 16.0 0.274
secondary_cleaner.state.floatbank5_a_level 101 0.445 85 0.504 16.0 0.274
secondary_cleaner.state.floatbank5_b_air 101 0.445 85 0.504 16.0 0.274
secondary_cleaner.state.floatbank5_b_level 100 0.440 84 0.498 16.0 0.274
secondary_cleaner.state.floatbank6_a_air 119 0.524 103 0.611 16.0 0.274
secondary_cleaner.state.floatbank6_a_level 101 0.445 85 0.504 16.0 0.274
In [23]:
# Using this line of code to find percentage in specific categories <1%, 1-<5%, and so on

with pd.option_context('display.max_rows', None):
    display(combined_missing[combined_missing[['pct_missing','pct_missing_train', 'pct_missing_test']] >= 15.0])
miss_cnt pct_missing miss_cnt_train pct_missing_train miss_cnt_test pct_missing_test
date NaN NaN NaN NaN NaN NaN
final.output.concentrate_ag NaN NaN NaN NaN NaN NaN
final.output.concentrate_pb NaN NaN NaN NaN NaN NaN
final.output.concentrate_sol NaN NaN NaN NaN NaN NaN
final.output.concentrate_au NaN NaN NaN NaN NaN NaN
final.output.recovery NaN NaN NaN NaN NaN NaN
final.output.tail_ag NaN NaN NaN NaN NaN NaN
final.output.tail_pb NaN NaN NaN NaN NaN NaN
final.output.tail_sol NaN NaN NaN NaN NaN NaN
final.output.tail_au NaN NaN NaN NaN NaN NaN
primary_cleaner.input.sulfate NaN NaN NaN NaN NaN NaN
primary_cleaner.input.depressant NaN NaN NaN NaN NaN NaN
primary_cleaner.input.feed_size NaN NaN NaN NaN NaN NaN
primary_cleaner.input.xanthate NaN NaN NaN NaN NaN NaN
primary_cleaner.output.concentrate_ag NaN NaN NaN NaN NaN NaN
primary_cleaner.output.concentrate_pb NaN NaN NaN NaN NaN NaN
primary_cleaner.output.concentrate_sol NaN NaN NaN NaN NaN NaN
primary_cleaner.output.concentrate_au NaN NaN NaN NaN NaN NaN
primary_cleaner.output.tail_ag NaN NaN NaN NaN NaN NaN
primary_cleaner.output.tail_pb NaN NaN NaN NaN NaN NaN
primary_cleaner.output.tail_sol NaN NaN NaN NaN NaN NaN
primary_cleaner.output.tail_au NaN NaN NaN NaN NaN NaN
primary_cleaner.state.floatbank8_a_air NaN NaN NaN NaN NaN NaN
primary_cleaner.state.floatbank8_a_level NaN NaN NaN NaN NaN NaN
primary_cleaner.state.floatbank8_b_air NaN NaN NaN NaN NaN NaN
primary_cleaner.state.floatbank8_b_level NaN NaN NaN NaN NaN NaN
primary_cleaner.state.floatbank8_c_air NaN NaN NaN NaN NaN NaN
primary_cleaner.state.floatbank8_c_level NaN NaN NaN NaN NaN NaN
primary_cleaner.state.floatbank8_d_air NaN NaN NaN NaN NaN NaN
primary_cleaner.state.floatbank8_d_level NaN NaN NaN NaN NaN NaN
rougher.calculation.sulfate_to_au_concentrate NaN NaN NaN NaN NaN NaN
rougher.calculation.floatbank10_sulfate_to_au_feed NaN NaN NaN NaN NaN NaN
rougher.calculation.floatbank11_sulfate_to_au_feed NaN NaN NaN NaN NaN NaN
rougher.calculation.au_pb_ratio NaN NaN NaN NaN NaN NaN
rougher.input.feed_ag NaN NaN NaN NaN NaN NaN
rougher.input.feed_pb NaN NaN NaN NaN NaN NaN
rougher.input.feed_rate NaN NaN NaN NaN NaN NaN
rougher.input.feed_size NaN NaN NaN NaN NaN NaN
rougher.input.feed_sol NaN NaN NaN NaN NaN NaN
rougher.input.feed_au NaN NaN NaN NaN NaN NaN
rougher.input.floatbank10_sulfate NaN NaN NaN NaN NaN NaN
rougher.input.floatbank10_xanthate NaN NaN NaN NaN NaN NaN
rougher.input.floatbank11_sulfate NaN NaN NaN NaN NaN NaN
rougher.input.floatbank11_xanthate NaN NaN NaN NaN NaN NaN
rougher.output.concentrate_ag NaN NaN NaN NaN NaN NaN
rougher.output.concentrate_pb NaN NaN NaN NaN NaN NaN
rougher.output.concentrate_sol NaN NaN NaN NaN NaN NaN
rougher.output.concentrate_au NaN NaN NaN NaN NaN NaN
rougher.output.recovery NaN NaN NaN 15.261 NaN NaN
rougher.output.tail_ag NaN NaN NaN NaN NaN NaN
rougher.output.tail_pb NaN NaN NaN NaN NaN NaN
rougher.output.tail_sol NaN NaN NaN NaN NaN NaN
rougher.output.tail_au NaN NaN NaN NaN NaN NaN
rougher.state.floatbank10_a_air NaN NaN NaN NaN NaN NaN
rougher.state.floatbank10_a_level NaN NaN NaN NaN NaN NaN
rougher.state.floatbank10_b_air NaN NaN NaN NaN NaN NaN
rougher.state.floatbank10_b_level NaN NaN NaN NaN NaN NaN
rougher.state.floatbank10_c_air NaN NaN NaN NaN NaN NaN
rougher.state.floatbank10_c_level NaN NaN NaN NaN NaN NaN
rougher.state.floatbank10_d_air NaN NaN NaN NaN NaN NaN
rougher.state.floatbank10_d_level NaN NaN NaN NaN NaN NaN
rougher.state.floatbank10_e_air NaN NaN NaN NaN NaN NaN
rougher.state.floatbank10_e_level NaN NaN NaN NaN NaN NaN
rougher.state.floatbank10_f_air NaN NaN NaN NaN NaN NaN
rougher.state.floatbank10_f_level NaN NaN NaN NaN NaN NaN
secondary_cleaner.output.tail_ag NaN NaN NaN NaN NaN NaN
secondary_cleaner.output.tail_pb NaN NaN NaN NaN NaN NaN
secondary_cleaner.output.tail_sol NaN NaN NaN NaN NaN NaN
secondary_cleaner.output.tail_au NaN NaN NaN NaN NaN NaN
secondary_cleaner.state.floatbank2_a_air NaN NaN NaN NaN NaN NaN
secondary_cleaner.state.floatbank2_a_level NaN NaN NaN NaN NaN NaN
secondary_cleaner.state.floatbank2_b_air NaN NaN NaN NaN NaN NaN
secondary_cleaner.state.floatbank2_b_level NaN NaN NaN NaN NaN NaN
secondary_cleaner.state.floatbank3_a_air NaN NaN NaN NaN NaN NaN
secondary_cleaner.state.floatbank3_a_level NaN NaN NaN NaN NaN NaN
secondary_cleaner.state.floatbank3_b_air NaN NaN NaN NaN NaN NaN
secondary_cleaner.state.floatbank3_b_level NaN NaN NaN NaN NaN NaN
secondary_cleaner.state.floatbank4_a_air NaN NaN NaN NaN NaN NaN
secondary_cleaner.state.floatbank4_a_level NaN NaN NaN NaN NaN NaN
secondary_cleaner.state.floatbank4_b_air NaN NaN NaN NaN NaN NaN
secondary_cleaner.state.floatbank4_b_level NaN NaN NaN NaN NaN NaN
secondary_cleaner.state.floatbank5_a_air NaN NaN NaN NaN NaN NaN
secondary_cleaner.state.floatbank5_a_level NaN NaN NaN NaN NaN NaN
secondary_cleaner.state.floatbank5_b_air NaN NaN NaN NaN NaN NaN
secondary_cleaner.state.floatbank5_b_level NaN NaN NaN NaN NaN NaN
secondary_cleaner.state.floatbank6_a_air NaN NaN NaN NaN NaN NaN
secondary_cleaner.state.floatbank6_a_level NaN NaN NaN NaN NaN NaN

MISSING DATA ANALYSIS SUMMARY

Percentage of Missing Data by Severity Level

< 1% Missing Data

Shared (Full / Train / Test)
Feature Full % Train % Test %
date 0.000 0.000 0.000
primary_cleaner.input.feed_size 0.000 0.000 0.000
primary_cleaner.state.floatbank8_a_air 0.247 0.237 0.274
primary_cleaner.state.floatbank8_a_level 0.216 0.196 0.274
primary_cleaner.state.floatbank8_b_air 0.247 0.237 0.274
primary_cleaner.state.floatbank8_b_level 0.189 0.160 0.274
primary_cleaner.state.floatbank8_c_air 0.238 0.225 0.274
primary_cleaner.state.floatbank8_c_level 0.189 0.160 0.274
primary_cleaner.state.floatbank8_d_air 0.242 0.231 0.274
primary_cleaner.state.floatbank8_d_level 0.189 0.160 0.274
rougher.input.feed_ag 0.431 0.486 0.274
rougher.input.feed_au 0.436 0.492 0.274
rougher.state.floatbank10_a_air 0.308 0.314 0.291
rougher.state.floatbank10_a_level 0.304 0.314 0.274
rougher.state.floatbank10_b_air 0.308 0.314 0.291
rougher.state.floatbank10_b_level 0.304 0.314 0.274
rougher.state.floatbank10_c_air 0.308 0.314 0.291
rougher.state.floatbank10_c_level 0.273 0.273 0.274
rougher.state.floatbank10_d_air 0.330 0.344 0.291
rougher.state.floatbank10_d_level 0.295 0.302 0.274
rougher.state.floatbank10_e_level 0.295 0.302 0.274
rougher.state.floatbank10_f_air 0.330 0.344 0.291
rougher.state.floatbank10_f_level 0.326 0.344 0.274
secondary_cleaner.state.floatbank2_a_level 0.550 0.647 0.274
secondary_cleaner.state.floatbank2_b_air 0.784 0.919 0.394
secondary_cleaner.state.floatbank2_b_level 0.563 0.664 0.274
secondary_cleaner.state.floatbank3_a_air 0.577 0.575 0.583
secondary_cleaner.state.floatbank3_a_level 0.568 0.670 0.274
secondary_cleaner.state.floatbank3_b_air 0.546 0.641 0.274
secondary_cleaner.state.floatbank3_b_level 0.555 0.652 0.274
secondary_cleaner.state.floatbank4_a_air 0.638 0.765 0.274
secondary_cleaner.state.floatbank4_a_level 0.568 0.670 0.274
secondary_cleaner.state.floatbank4_b_air 0.475 0.546 0.274
secondary_cleaner.state.floatbank4_b_level 0.480 0.552 0.274
secondary_cleaner.state.floatbank5_a_air 0.445 0.504 0.274
secondary_cleaner.state.floatbank5_a_level 0.445 0.504 0.274
secondary_cleaner.state.floatbank5_b_air 0.445 0.504 0.274
secondary_cleaner.state.floatbank5_b_level 0.440 0.498 0.274
secondary_cleaner.state.floatbank6_a_air 0.524 0.611 0.274
secondary_cleaner.state.floatbank6_a_level 0.445 0.504 0.274
Full & Train Only
Feature Full % Train %
final.output.concentrate_ag 0.392 0.427
final.output.concentrate_pb 0.383 0.427
final.output.concentrate_au 0.379 0.421
final.output.tail_ag 0.365 0.391
final.output.tail_au 0.357 0.391
primary_cleaner.output.concentrate_ag 0.431 0.486
primary_cleaner.output.concentrate_au 0.431 0.486
primary_cleaner.output.tail_ag 0.449 0.492
primary_cleaner.output.tail_pb 0.537 0.587
primary_cleaner.output.tail_au 0.436 0.492
rougher.calculation.sulfate_to_au_concentrate 0.194 0.160
rougher.calculation.floatbank10_sulfate_to_au_feed 0.194 0.160
rougher.calculation.floatbank11_sulfate_to_au_feed 0.194 0.160
rougher.output.concentrate_ag 0.431 0.486
rougher.output.concentrate_pb 0.431 0.486
rougher.output.concentrate_sol 0.836 0.961
rougher.output.concentrate_au 0.431 0.486
rougher.output.tail_pb 0.431 0.486
secondary_cleaner.output.tail_ag 0.440 0.498
secondary_cleaner.output.tail_pb 0.511 0.569
secondary_cleaner.output.tail_au 0.431 0.486
Dataset-Specific Features
Category Feature Percentage
Full Only final.output.tail_pb 0.880
Train Only final.output.tail_sol 0.860
Test Only rougher.input.feed_pb 0.274
Test Only rougher.input.feed_rate 0.685
Test Only rougher.input.feed_size 0.377
Test Only rougher.input.floatbank11_sulfate 0.942
Test Only rougher.state.floatbank10_e_air 0.291
Test Only secondary_cleaner.state.floatbank2_a_air 0.343

1% - < 5% Missing Data

Shared (Full / Train / Test)
Feature Full % Train % Test %
rougher.input.feed_sol 1.580 1.732 1.148
rougher.input.floatbank10_xanthate 2.065 2.052 2.108
Full & Train Only
Feature Full % Train %
final.output.concentrate_sol 1.695 2.195
primary_cleaner.output.concentrate_pb 1.972 2.123
primary_cleaner.output.concentrate_sol 3.513 3.772
primary_cleaner.output.tail_sol 1.545 1.667
rougher.input.feed_pb 1.074 1.352
rougher.input.feed_rate 2.434 3.043
rougher.input.feed_size 1.933 2.473
rougher.input.floatbank11_sulfate 2.985 3.695
rougher.state.floatbank10_e_air 2.729 3.577
secondary_cleaner.state.floatbank2_a_air 1.686 2.153
Dataset-Specific Features
Category Feature Percentage
Full Only final.output.tail_sol 1.193
Train Only final.output.tail_pb 1.085
Test Only primary_cleaner.input.depressant 4.866
Test Only primary_cleaner.input.xanthate 2.844
Test Only rougher.input.floatbank10_sulfate 4.404

5% - < 10% Missing Data

Shared (Full / Train / Test)
Feature Full % Train % Test %
primary_cleaner.input.sulfate 7.083 7.752 5.175
Full & Train Only
Feature Full % Train %
final.output.recovery 8.641 9.021
primary_cleaner.input.depressant 6.806 7.485
primary_cleaner.input.xanthate 5.067 5.842
rougher.calculation.au_pb_ratio 7.162 7.367
rougher.input.floatbank10_sulfate 5.727 6.192
Full & Test Only
Feature Full % Test %
rougher.input.floatbank11_xanthate 9.936 6.049
Dataset-Specific Features
Category Feature Percentage
Full Only secondary_cleaner.output.tail_sol 9.751

10% - < 15% Missing Data

Full & Train Only
Feature Full % Train %
rougher.output.tail_ag 12.049 13.345
rougher.output.tail_sol 12.044 13.339
rougher.output.tail_au 12.044 13.339
Dataset-Specific Features
Category Feature Percentage
Full Only rougher.output.recovery 13.730
Train Only rougher.input.floatbank11_xanthate 11.293
Train Only secondary_cleaner.output.tail_sol 11.779

≥ 15% Missing Data

Dataset-Specific Features
Category Feature Percentage
Train Only rougher.output.recovery 15.261

Summary Statistics

Dataset Total Features Features with Missing Data Complete Features
Full Dataset 87 85 2
Training Set 87 85 2
Test Set 53 51 2
Missing bins All Full Train Test Full + Train Full + Test
< 1% 40 1 1 6 21 –
1 – < 5% 2 1 1 3 10 –
5 – < 10% 1 1 – – 5 1
10 – < 15% – 1 2 – 3 –
≥ 15% – – – 1 – –
Total 43 4 5 9 39 1

General Threshold Meaning

  • < 1% : Negligible (Imputaion almost never necessary)
  • 1 - < 5% : Minor (Imputation sometimes necessary)
  • 5 - < 10% : Intermediate (Imputation usually necessary)
  • 10 - < 15% : High (Imputation often necessary)
  • ≥ 15% : Extremely High (Imputation almost always necessary)

Note: Features marked as "Full & Train Only" or some "Full Only" features do not exist in the test set as they are output/target/calculation features excluded for prediction tasks.

MISSING DATA BY DATASET AND SEVERITY LEVEL

Full Dataset Missing Data Distribution

< 1% Missing Data (62 features)

Feature Percentage
date 0.000
primary_cleaner.input.feed_size 0.000
primary_cleaner.state.floatbank8_b_level 0.189
primary_cleaner.state.floatbank8_c_level 0.189
primary_cleaner.state.floatbank8_d_level 0.189
rougher.calculation.sulfate_to_au_concentrate 0.194
rougher.calculation.floatbank10_sulfate_to_au_feed 0.194
rougher.calculation.floatbank11_sulfate_to_au_feed 0.194
primary_cleaner.state.floatbank8_a_level 0.216
primary_cleaner.state.floatbank8_c_air 0.238
primary_cleaner.state.floatbank8_d_air 0.242
primary_cleaner.state.floatbank8_a_air 0.247
primary_cleaner.state.floatbank8_b_air 0.247
rougher.state.floatbank10_c_level 0.273
rougher.state.floatbank10_d_level 0.295
rougher.state.floatbank10_e_level 0.295
rougher.state.floatbank10_a_level 0.304
rougher.state.floatbank10_b_level 0.304
rougher.state.floatbank10_a_air 0.308
rougher.state.floatbank10_b_air 0.308
rougher.state.floatbank10_c_air 0.308
rougher.state.floatbank10_f_level 0.326
rougher.state.floatbank10_d_air 0.330
rougher.state.floatbank10_f_air 0.330
final.output.tail_au 0.357
final.output.tail_ag 0.365
final.output.concentrate_au 0.379
final.output.concentrate_pb 0.383
final.output.concentrate_ag 0.392
rougher.input.feed_ag 0.431
primary_cleaner.output.concentrate_ag 0.431
primary_cleaner.output.concentrate_au 0.431
rougher.output.concentrate_ag 0.431
rougher.output.concentrate_pb 0.431
rougher.output.concentrate_au 0.431
rougher.output.tail_pb 0.431
secondary_cleaner.output.tail_au 0.431
primary_cleaner.output.tail_au 0.436
rougher.input.feed_au 0.436
secondary_cleaner.output.tail_ag 0.440
secondary_cleaner.state.floatbank5_b_level 0.440
secondary_cleaner.state.floatbank5_a_air 0.445
secondary_cleaner.state.floatbank5_a_level 0.445
secondary_cleaner.state.floatbank5_b_air 0.445
secondary_cleaner.state.floatbank6_a_level 0.445
primary_cleaner.output.tail_ag 0.449
secondary_cleaner.state.floatbank4_b_air 0.475
secondary_cleaner.state.floatbank4_b_level 0.480
secondary_cleaner.output.tail_pb 0.511
secondary_cleaner.state.floatbank6_a_air 0.524
primary_cleaner.output.tail_pb 0.537
secondary_cleaner.state.floatbank3_b_air 0.546
secondary_cleaner.state.floatbank2_a_level 0.550
secondary_cleaner.state.floatbank3_b_level 0.555
secondary_cleaner.state.floatbank2_b_level 0.563
secondary_cleaner.state.floatbank3_a_level 0.568
secondary_cleaner.state.floatbank4_a_level 0.568
secondary_cleaner.state.floatbank3_a_air 0.577
secondary_cleaner.state.floatbank4_a_air 0.638
secondary_cleaner.state.floatbank2_b_air 0.784
rougher.output.concentrate_sol 0.836
final.output.tail_pb 0.880

1% - < 5% Missing Data (14 features)

Feature Percentage
rougher.input.feed_pb 1.074
final.output.tail_sol 1.193
primary_cleaner.output.tail_sol 1.545
rougher.input.feed_sol 1.580
secondary_cleaner.state.floatbank2_a_air 1.686
final.output.concentrate_sol 1.695
rougher.input.feed_size 1.933
primary_cleaner.output.concentrate_pb 1.972
rougher.input.floatbank10_xanthate 2.065
rougher.input.feed_rate 2.434
rougher.state.floatbank10_e_air 2.729
rougher.input.floatbank11_sulfate 2.985
primary_cleaner.output.concentrate_sol 3.513
primary_cleaner.input.xanthate 5.067

5% - < 10% Missing Data (7 features)

Feature Percentage
rougher.input.floatbank10_sulfate 5.727
primary_cleaner.input.depressant 6.806
primary_cleaner.input.sulfate 7.083
rougher.calculation.au_pb_ratio 7.162
final.output.recovery 8.641
secondary_cleaner.output.tail_sol 9.751
rougher.input.floatbank11_xanthate 9.936

10% - < 15% Missing Data (4 features)

Feature Percentage
rougher.output.tail_sol 12.044
rougher.output.tail_au 12.044
rougher.output.tail_ag 12.049
rougher.output.recovery 13.730

≥ 15% Missing Data (0 features)

No features in the Full dataset have ≥ 15% missing data.


Training Dataset Missing Data Distribution

< 1% Missing Data (62 features)

Feature Percentage
date 0.000
primary_cleaner.input.feed_size 0.000
rougher.calculation.sulfate_to_au_concentrate 0.160
rougher.calculation.floatbank10_sulfate_to_au_feed 0.160
rougher.calculation.floatbank11_sulfate_to_au_feed 0.160
primary_cleaner.state.floatbank8_b_level 0.160
primary_cleaner.state.floatbank8_c_level 0.160
primary_cleaner.state.floatbank8_d_level 0.160
primary_cleaner.state.floatbank8_a_level 0.196
primary_cleaner.state.floatbank8_c_air 0.225
primary_cleaner.state.floatbank8_d_air 0.231
primary_cleaner.state.floatbank8_a_air 0.237
primary_cleaner.state.floatbank8_b_air 0.237
rougher.state.floatbank10_c_level 0.273
rougher.state.floatbank10_d_level 0.302
rougher.state.floatbank10_e_level 0.302
rougher.state.floatbank10_a_level 0.314
rougher.state.floatbank10_b_level 0.314
rougher.state.floatbank10_a_air 0.314
rougher.state.floatbank10_b_air 0.314
rougher.state.floatbank10_c_air 0.314
rougher.state.floatbank10_d_air 0.344
rougher.state.floatbank10_f_level 0.344
rougher.state.floatbank10_f_air 0.344
final.output.tail_au 0.391
final.output.tail_ag 0.391
final.output.concentrate_au 0.421
final.output.concentrate_pb 0.427
final.output.concentrate_ag 0.427
primary_cleaner.output.concentrate_ag 0.486
primary_cleaner.output.concentrate_au 0.486
rougher.output.concentrate_ag 0.486
rougher.output.concentrate_pb 0.486
rougher.output.concentrate_au 0.486
rougher.output.tail_pb 0.486
secondary_cleaner.output.tail_au 0.486
rougher.input.feed_ag 0.486
primary_cleaner.output.tail_au 0.492
primary_cleaner.output.tail_ag 0.492
rougher.input.feed_au 0.492
secondary_cleaner.output.tail_ag 0.498
secondary_cleaner.state.floatbank5_b_level 0.498
secondary_cleaner.state.floatbank5_a_air 0.504
secondary_cleaner.state.floatbank5_a_level 0.504
secondary_cleaner.state.floatbank5_b_air 0.504
secondary_cleaner.state.floatbank6_a_level 0.504
secondary_cleaner.state.floatbank4_b_air 0.546
secondary_cleaner.state.floatbank4_b_level 0.552
secondary_cleaner.output.tail_pb 0.569
secondary_cleaner.state.floatbank3_a_air 0.575
primary_cleaner.output.tail_pb 0.587
secondary_cleaner.state.floatbank6_a_air 0.611
secondary_cleaner.state.floatbank3_b_air 0.641
secondary_cleaner.state.floatbank2_a_level 0.647
secondary_cleaner.state.floatbank3_b_level 0.652
secondary_cleaner.state.floatbank2_b_level 0.664
secondary_cleaner.state.floatbank3_a_level 0.670
secondary_cleaner.state.floatbank4_a_level 0.670
secondary_cleaner.state.floatbank4_a_air 0.765
final.output.tail_sol 0.860
secondary_cleaner.state.floatbank2_b_air 0.919
rougher.output.concentrate_sol 0.961

1% - < 5% Missing Data (14 features)

Feature Percentage
final.output.tail_pb 1.085
rougher.input.feed_pb 1.352
primary_cleaner.output.tail_sol 1.667
rougher.input.feed_sol 1.732
rougher.input.floatbank10_xanthate 2.052
secondary_cleaner.state.floatbank2_a_air 2.153
final.output.concentrate_sol 2.195
primary_cleaner.output.concentrate_pb 2.123
rougher.input.feed_size 2.473
rougher.input.feed_rate 3.043
rougher.state.floatbank10_e_air 3.577
rougher.input.floatbank11_sulfate 3.695
primary_cleaner.output.concentrate_sol 3.772
primary_cleaner.input.xanthate 5.842

5% - < 10% Missing Data (5 features)

Feature Percentage
rougher.input.floatbank10_sulfate 6.192
rougher.calculation.au_pb_ratio 7.367
primary_cleaner.input.depressant 7.485
primary_cleaner.input.sulfate 7.752
final.output.recovery 9.021

10% - < 15% Missing Data (5 features)

Feature Percentage
rougher.input.floatbank11_xanthate 11.293
secondary_cleaner.output.tail_sol 11.779
rougher.output.tail_sol 13.339
rougher.output.tail_au 13.339
rougher.output.tail_ag 13.345

≥ 15% Missing Data (1 feature)

Feature Percentage
rougher.output.recovery 15.261

Test Dataset Missing Data Distribution

< 1% Missing Data (45 features)

Feature Percentage
date 0.000
primary_cleaner.input.feed_size 0.000
primary_cleaner.state.floatbank8_a_air 0.274
primary_cleaner.state.floatbank8_a_level 0.274
primary_cleaner.state.floatbank8_b_air 0.274
primary_cleaner.state.floatbank8_b_level 0.274
primary_cleaner.state.floatbank8_c_air 0.274
primary_cleaner.state.floatbank8_c_level 0.274
primary_cleaner.state.floatbank8_d_air 0.274
primary_cleaner.state.floatbank8_d_level 0.274
rougher.input.feed_ag 0.274
rougher.input.feed_pb 0.274
rougher.input.feed_au 0.274
rougher.state.floatbank10_a_level 0.274
rougher.state.floatbank10_b_level 0.274
rougher.state.floatbank10_c_level 0.274
rougher.state.floatbank10_d_level 0.274
rougher.state.floatbank10_e_level 0.274
rougher.state.floatbank10_f_level 0.274
secondary_cleaner.state.floatbank2_a_level 0.274
secondary_cleaner.state.floatbank2_b_level 0.274
secondary_cleaner.state.floatbank3_a_level 0.274
secondary_cleaner.state.floatbank3_b_level 0.274
secondary_cleaner.state.floatbank4_a_air 0.274
secondary_cleaner.state.floatbank4_a_level 0.274
secondary_cleaner.state.floatbank4_b_air 0.274
secondary_cleaner.state.floatbank4_b_level 0.274
secondary_cleaner.state.floatbank5_a_air 0.274
secondary_cleaner.state.floatbank5_a_level 0.274
secondary_cleaner.state.floatbank5_b_air 0.274
secondary_cleaner.state.floatbank5_b_level 0.274
secondary_cleaner.state.floatbank6_a_air 0.274
secondary_cleaner.state.floatbank6_a_level 0.274
rougher.state.floatbank10_a_air 0.291
rougher.state.floatbank10_b_air 0.291
rougher.state.floatbank10_c_air 0.291
rougher.state.floatbank10_d_air 0.291
rougher.state.floatbank10_e_air 0.291
rougher.state.floatbank10_f_air 0.291
secondary_cleaner.state.floatbank2_a_air 0.343
rougher.input.feed_size 0.377
secondary_cleaner.state.floatbank2_b_air 0.394
secondary_cleaner.state.floatbank3_a_air 0.583
rougher.input.feed_rate 0.685
rougher.input.floatbank11_sulfate 0.942

1% - < 5% Missing Data (5 features)

Feature Percentage
rougher.input.feed_sol 1.148
rougher.input.floatbank10_xanthate 2.108
primary_cleaner.input.xanthate 2.844
rougher.input.floatbank10_sulfate 4.404
primary_cleaner.input.depressant 4.866

5% - < 10% Missing Data (2 features)

Feature Percentage
primary_cleaner.input.sulfate 5.175
rougher.input.floatbank11_xanthate 6.049

10% - < 15% Missing Data (0 features)

No features in the Test dataset have 10% - < 15% missing data.

≥ 15% Missing Data (0 features)

No features in the Test dataset have ≥ 15% missing data.


Dataset Comparison Summary

Missing Data Level Full Dataset Training Dataset Test Dataset
< 1% 62 features 62 features 45 features
1% - < 5% 14 features 14 features 5 features
5% - < 10% 7 features 5 features 2 features
10% - < 15% 4 features 5 features 0 features
≥ 15% 0 features 1 feature 0 features
Total with Missing Data 87 features 87 features 53 features
Missing Data Level Full Dataset Training Dataset Test Dataset
< 1% 71.26% 71.26% 84.91%
1% - < 5% 16.10% 16.10% 9.43%
5% - < 10% 8.05% 5.75% 3.77%
10% - < 15% 4.60% 5.57% 0%
≥ 15% 0% 1.15% 0%
Total with Missing Data 87 features 87 features 53 features

Key Insights:

  • Test dataset appears cleaner because it excludes 34 output/target/calculation features, which are the primary source of high missingness in the Full and Training sets.
  • Shared features show similar data quality across datasets, with the Test set sometimes performing slightly better on input features.
  • Training dataset reveals the full scope of missing data, including the most problematic target feature (rougher.output.recovery, 15.26% missing).
  • Output and calculation features consistently drive higher missing rates, while predictor (input) features remain relatively complete.
  • Excluded features explain most severe missing data issues, confirming that the Test set is intentionally designed for clean model evaluation.

Distribution of Missing Values

  • Negligible (< 1%): Full & Train ~71% of features; Test ~85% → most data is very clean.
  • Moderate (1–<10%): Full: ~24% (21); Train: ~22% (19); Test: ~13% (7) → mostly input features; manageable with simple imputation.
  • Severe (≥ 10%): Full: ~5% (4 features), Train: ~7% (6 features) → all outputs/calculations. Test: (N/A) → explains clean profile.

Modeling Impact

  • Predictor features: Low missingness (< 5%) → imputation straightforward and unlikely to distort results.
  • Target/output features: Higher missingness but not used for prediction → no direct risk to training or model reliability.

Fix Missing Values¶

  • The highest percent of missing values is for rougher.output.recovery in the Training Set (15.26%) and the Full Dataset (13.73%)
  • Target = final.output.recovery
In [24]:
# Get rid of the target Nan values `

# Training Dataset 

# View the NaN values for final.output.recovery (1,521 - 9.06% of data)
display(gold_train['final.output.recovery'].isna().sum())

# Drop the NaN values in final.output.recovery
train_target_df = pd.DataFrame(gold_train['final.output.recovery'].dropna())
display(train_target_df)

# Combine the original DF with train_target_df
gold_train_new = train_target_df.join(gold_train.drop(columns = ['final.output.recovery']), how = 'left')
display(gold_train_new)
1521
final.output.recovery
0 70.541216
1 69.266198
2 68.116445
3 68.347543
4 66.927016
... ...
16855 73.755150
16856 69.049291
16857 67.002189
16858 65.523246
16859 70.281454

15339 rows × 1 columns

final.output.recovery date final.output.concentrate_ag final.output.concentrate_pb final.output.concentrate_sol final.output.concentrate_au final.output.tail_ag final.output.tail_pb final.output.tail_sol final.output.tail_au ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
0 70.541216 2016-01-15 00:00:00 6.055403 9.889648 5.507324 42.192020 10.411962 0.895447 16.904297 2.143149 ... 14.016835 -502.488007 12.099931 -504.715942 9.925633 -498.310211 8.079666 -500.470978 14.151341 -605.841980
1 69.266198 2016-01-15 01:00:00 6.029369 9.968944 5.257781 42.701629 10.462676 0.927452 16.634514 2.224930 ... 13.992281 -505.503262 11.950531 -501.331529 10.039245 -500.169983 7.984757 -500.582168 13.998353 -599.787184
2 68.116445 2016-01-15 02:00:00 6.055926 10.213995 5.383759 42.657501 10.507046 0.953716 16.208849 2.257889 ... 14.015015 -502.520901 11.912783 -501.133383 10.070913 -500.129135 8.013877 -500.517572 14.028663 -601.427363
3 68.347543 2016-01-15 03:00:00 6.047977 9.977019 4.858634 42.689819 10.422762 0.883763 16.532835 2.146849 ... 14.036510 -500.857308 11.999550 -501.193686 9.970366 -499.201640 7.977324 -500.255908 14.005551 -599.996129
4 66.927016 2016-01-15 04:00:00 6.148599 10.142511 4.939416 42.774141 10.360302 0.792826 16.525686 2.055292 ... 14.027298 -499.838632 11.953070 -501.053894 9.925709 -501.686727 7.894242 -500.356035 13.996647 -601.496691
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
16855 73.755150 2018-08-18 06:59:59 3.224920 11.356233 6.803482 46.713954 8.769645 3.141541 10.403181 1.529220 ... 23.031497 -501.167942 20.007571 -499.740028 18.006038 -499.834374 13.001114 -500.155694 20.007840 -501.296428
16856 69.049291 2018-08-18 07:59:59 3.195978 11.349355 6.862249 46.866780 8.897321 3.130493 10.549470 1.612542 ... 22.960095 -501.612783 20.035660 -500.251357 17.998535 -500.395178 12.954048 -499.895163 19.968498 -501.041608
16857 67.002189 2018-08-18 08:59:59 3.109998 11.434366 6.886013 46.795691 8.529606 2.911418 11.115147 1.596616 ... 23.015718 -501.711599 19.951231 -499.857027 18.019543 -500.451156 13.023431 -499.914391 19.990885 -501.518452
16858 65.523246 2018-08-18 09:59:59 3.367241 11.625587 6.799433 46.408188 8.777171 2.819214 10.463847 1.602879 ... 23.024963 -501.153409 20.054122 -500.314711 17.979515 -499.272871 12.992404 -499.976268 20.013986 -500.625471
16859 70.281454 2018-08-18 10:59:59 3.598375 11.737832 6.717509 46.299438 8.406690 2.517518 10.652193 1.389434 ... 23.018622 -500.492702 20.020205 -500.220296 17.963512 -499.939490 12.990306 -500.080993 19.990336 -499.191575

15339 rows × 87 columns

In [25]:
# Training Dataset

# Look at the missing values and re-determine percentage
new_missing_count = gold_train_new.isna().sum()

# Separates and calculates the pct values
new_missing_count_array = (new_missing_count.values / 15339) * 100

# Creates a DF w/ just the index(columns in this case) with the missing count
new_missing_count_df = pd.DataFrame(new_missing_count, columns = ['missing_cnt'])

# Creates a DF w/ just the index(columns in this case) with the missing percent
new_missing_count_2 = pd.DataFrame(new_missing_count_array, index = new_missing_count_df.index, columns = ['missing_pct'])

# Combines the 2 DF's to show count and percent missing
new_missing_count_df = pd.concat([new_missing_count_df,new_missing_count_2], axis = 1)
In [26]:
# Look at missing_pct that is greater than or equal to 1 
new_missing_count_df[new_missing_count_df['missing_pct'] >= 1].sort_values(by = 'missing_pct',ascending = False)
Out[26]:
missing_cnt missing_pct
secondary_cleaner.output.tail_sol 1778 11.591368
rougher.output.recovery 1190 7.758002
rougher.output.tail_ag 967 6.304192
rougher.output.tail_sol 966 6.297673
rougher.output.tail_au 966 6.297673
rougher.input.floatbank11_xanthate 779 5.078558
rougher.state.floatbank10_e_air 532 3.468283
primary_cleaner.output.concentrate_sol 408 2.659887
primary_cleaner.input.sulfate 381 2.483865
rougher.input.floatbank10_sulfate 375 2.444749
rougher.input.floatbank11_sulfate 357 2.327401
primary_cleaner.input.xanthate 276 1.799335
final.output.concentrate_sol 267 1.740661
primary_cleaner.input.depressant 257 1.675468
secondary_cleaner.state.floatbank2_a_air 230 1.499446
rougher.input.feed_rate 218 1.421214
primary_cleaner.output.concentrate_pb 161 1.049612
In [27]:
# Get rid of the target Nan values

# Full Dataset 

# View the NaN values for final.output.recovery (1,963 - 8.64% of data)
display(gold_full['final.output.recovery'].isna().sum())

# Drop the NaN values in final.output.recovery
full_target_df = pd.DataFrame(gold_full['final.output.recovery'].dropna())

# Combine the original DF with full_target_df
gold_full_new = full_target_df.join(gold_full.drop(columns = ['final.output.recovery']), how = 'left')
display(gold_full_new)
1963
final.output.recovery date final.output.concentrate_ag final.output.concentrate_pb final.output.concentrate_sol final.output.concentrate_au final.output.tail_ag final.output.tail_pb final.output.tail_sol final.output.tail_au ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
0 70.541216 2016-01-15 00:00:00 6.055403 9.889648 5.507324 42.192020 10.411962 0.895447 16.904297 2.143149 ... 14.016835 -502.488007 12.099931 -504.715942 9.925633 -498.310211 8.079666 -500.470978 14.151341 -605.841980
1 69.266198 2016-01-15 01:00:00 6.029369 9.968944 5.257781 42.701629 10.462676 0.927452 16.634514 2.224930 ... 13.992281 -505.503262 11.950531 -501.331529 10.039245 -500.169983 7.984757 -500.582168 13.998353 -599.787184
2 68.116445 2016-01-15 02:00:00 6.055926 10.213995 5.383759 42.657501 10.507046 0.953716 16.208849 2.257889 ... 14.015015 -502.520901 11.912783 -501.133383 10.070913 -500.129135 8.013877 -500.517572 14.028663 -601.427363
3 68.347543 2016-01-15 03:00:00 6.047977 9.977019 4.858634 42.689819 10.422762 0.883763 16.532835 2.146849 ... 14.036510 -500.857308 11.999550 -501.193686 9.970366 -499.201640 7.977324 -500.255908 14.005551 -599.996129
4 66.927016 2016-01-15 04:00:00 6.148599 10.142511 4.939416 42.774141 10.360302 0.792826 16.525686 2.055292 ... 14.027298 -499.838632 11.953070 -501.053894 9.925709 -501.686727 7.894242 -500.356035 13.996647 -601.496691
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
22711 73.755150 2018-08-18 06:59:59 3.224920 11.356233 6.803482 46.713954 8.769645 3.141541 10.403181 1.529220 ... 23.031497 -501.167942 20.007571 -499.740028 18.006038 -499.834374 13.001114 -500.155694 20.007840 -501.296428
22712 69.049291 2018-08-18 07:59:59 3.195978 11.349355 6.862249 46.866780 8.897321 3.130493 10.549470 1.612542 ... 22.960095 -501.612783 20.035660 -500.251357 17.998535 -500.395178 12.954048 -499.895163 19.968498 -501.041608
22713 67.002189 2018-08-18 08:59:59 3.109998 11.434366 6.886013 46.795691 8.529606 2.911418 11.115147 1.596616 ... 23.015718 -501.711599 19.951231 -499.857027 18.019543 -500.451156 13.023431 -499.914391 19.990885 -501.518452
22714 65.523246 2018-08-18 09:59:59 3.367241 11.625587 6.799433 46.408188 8.777171 2.819214 10.463847 1.602879 ... 23.024963 -501.153409 20.054122 -500.314711 17.979515 -499.272871 12.992404 -499.976268 20.013986 -500.625471
22715 70.281454 2018-08-18 10:59:59 3.598375 11.737832 6.717509 46.299438 8.406690 2.517518 10.652193 1.389434 ... 23.018622 -500.492702 20.020205 -500.220296 17.963512 -499.939490 12.990306 -500.080993 19.990336 -499.191575

20753 rows × 87 columns

In [28]:
# Full Dataset

# Look at the missing values and re-determine percentage
new_missing_full_count = gold_full_new.isna().sum()

# Separates and calculates the pct values
new_missing_count_array_1 = (new_missing_full_count.values / 20753) * 100

# Creates a DF w/ just the index(columns in this case) with the missing count
new_missing_full_count_df = pd.DataFrame(new_missing_full_count, columns = ['missing_cnt'])

# Creates a DF w/ just the index(columns in this case) with the missing percent
new_missing_full_count_2 = pd.DataFrame(new_missing_count_array_1, index = new_missing_full_count_df.index, columns = ['missing_pct'])

# Combines the 2 DF's to show count and percent missing
new_missing_full_count_df = pd.concat([new_missing_full_count_df,new_missing_full_count_2], axis = 1)
display(new_missing_full_count_df)
missing_cnt missing_pct
final.output.recovery 0 0.000000
date 0 0.000000
final.output.concentrate_ag 1 0.004819
final.output.concentrate_pb 1 0.004819
final.output.concentrate_sol 267 1.286561
... ... ...
secondary_cleaner.state.floatbank5_a_level 1 0.004819
secondary_cleaner.state.floatbank5_b_air 1 0.004819
secondary_cleaner.state.floatbank5_b_level 1 0.004819
secondary_cleaner.state.floatbank6_a_air 3 0.014456
secondary_cleaner.state.floatbank6_a_level 1 0.004819

87 rows × 2 columns

In [29]:
# Look at missing_pct that is greater than or equal to 1 
new_missing_full_count_df[new_missing_full_count_df['missing_pct'] >= 1].sort_values(by = 'missing_pct',ascending = False)
Out[29]:
missing_cnt missing_pct
secondary_cleaner.output.tail_sol 1952 9.405869
rougher.output.recovery 1314 6.331615
rougher.output.tail_ag 1065 5.131788
rougher.output.tail_sol 1064 5.126970
rougher.output.tail_au 1064 5.126970
rougher.input.floatbank11_xanthate 812 3.912687
rougher.state.floatbank10_e_air 532 2.563485
primary_cleaner.output.concentrate_sol 517 2.491206
primary_cleaner.input.sulfate 388 1.869609
rougher.input.floatbank10_sulfate 380 1.831061
rougher.input.floatbank11_sulfate 368 1.773238
primary_cleaner.input.xanthate 282 1.358840
final.output.concentrate_sol 267 1.286561
primary_cleaner.input.depressant 263 1.267287
secondary_cleaner.state.floatbank2_a_air 233 1.122729
rougher.input.feed_rate 221 1.064906

MISSING DATA SUMMARY - TRAINING AND FULL DATASETS (TARGET HAS NO NAN VALUES)

Training Dataset Analysis

Missing final.output.recovery values: 1,521
Rows (no NaN values): 15,339
Data Removed: 9.06%

Columns ≥ 1% Missing Data

Feature Missing Percentage
secondary_cleaner.output.tail_sol ~11.6%
rougher.output.recovery ~7.8%
rougher.output.tail_ag ~6.3%
rougher.output.tail_sol ~6.3%
rougher.output.tail_au ~6.3%
rougher.input.floatbank11_xanthate ~5.1%
rougher.state.floatbank10_e_air ~3.5%
primary_cleaner.output.concentrate_sol ~2.7%
primary_cleaner.input.sulfate ~2.5%
rougher.input.floatbank10_sulfate ~2.4%
rougher.input.floatbank11_sulfate ~2.3%
primary_cleaner.input.xanthate ~1.8%
final.output.concentrate_sol ~1.7%
primary_cleaner.input.depressant ~1.7%
secondary_cleaner.state.floatbank2_a_air ~1.5%
rougher.input.feed_rate ~1.4%
primary_cleaner.output.concentrate_pb ~1.0%

Full Dataset Analysis

Missing final.output.recovery values: 1,963
Rows (no NaN values): 20,753
Data Removed: 8.64% >

Columns ≥ 1% Missing Data

Feature Missing Percentage
secondary_cleaner.output.tail_sol ~9.4%
rougher.output.recovery ~6.3%
rougher.output.tail_ag ~5.1%
rougher.output.tail_sol ~5.1%
rougher.output.tail_au ~5.1%
rougher.input.floatbank11_xanthate ~3.9%
rougher.state.floatbank10_e_air ~2.6%
primary_cleaner.output.concentrate_sol ~2.5%
primary_cleaner.input.sulfate ~1.9%
rougher.input.floatbank10_sulfate ~1.8%
rougher.input.floatbank11_sulfate ~1.8%
primary_cleaner.input.xanthate ~1.4%
final.output.concentrate_sol ~1.3%
primary_cleaner.input.depressant ~1.3%
secondary_cleaner.state.floatbank2_a_air ~1.1%
rougher.input.feed_rate ~1.1%

Key Observations

  • Training dataset has slightly higher missing data rates across most features compared to the full dataset
  • Output and target features (rougher.output., secondary_cleaner.output.) consistently show the highest missing data percentages
  • Input features generally have lower missing data rates than output features
  • Both datasets lose 8-9% of rows when removing NaN values for the final.output.recovery target variable

Drop Columns that May Cause Leakage¶

In [30]:
gold_full[['final.output.recovery']]
Out[30]:
final.output.recovery
0 70.541216
1 69.266198
2 68.116445
3 68.347543
4 66.927016
... ...
22711 73.755150
22712 69.049291
22713 67.002189
22714 65.523246
22715 70.281454

22716 rows × 1 columns

In [31]:
# Full Dataset

# Drop All calculation columns from the Full Dataset

calculation_full = gold_full_new.filter(like="calculation",axis = 1)
gold_full_new1 = gold_full_new.drop(calculation_full, axis=1)

# Drop all output columns from the Full Dataset except for final.output.recovery and rougher.output.recovery
output_full = gold_full_new1.filter(like = "output", axis = 1)

# Make a DF w/ only final.output.recovery and rougher.output.recovery
final_full = gold_full_new1[['final.output.recovery']]

# Drop rougher_final from the output_train DF
output_full = output_full.drop(final_full, axis=1)

# Drop output_train from gold_train_new1 (the new Training Dataset made)
gold_full_new1 = gold_full_new1.drop(output_full, axis = 1)
gold_full_new1

# Move final.output.recovery 
col = 'final.output.recovery'
cols = list(gold_full_new1.columns)
cols.insert(1,cols.pop(cols.index(col)))
gold_full_new1 = gold_full_new1[cols]

gold_full_new1
Out[31]:
date final.output.recovery primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
0 2016-01-15 00:00:00 70.541216 127.092003 10.128295 7.25 0.988759 1549.775757 -498.912140 1551.434204 -516.403442 ... 14.016835 -502.488007 12.099931 -504.715942 9.925633 -498.310211 8.079666 -500.470978 14.151341 -605.841980
1 2016-01-15 01:00:00 69.266198 125.629232 10.296251 7.25 1.002663 1576.166671 -500.904965 1575.950626 -499.865889 ... 13.992281 -505.503262 11.950531 -501.331529 10.039245 -500.169983 7.984757 -500.582168 13.998353 -599.787184
2 2016-01-15 02:00:00 68.116445 123.819808 11.316280 7.25 0.991265 1601.556163 -499.997791 1600.386685 -500.607762 ... 14.015015 -502.520901 11.912783 -501.133383 10.070913 -500.129135 8.013877 -500.517572 14.028663 -601.427363
3 2016-01-15 03:00:00 68.347543 122.270188 11.322140 7.25 0.996739 1599.968720 -500.951778 1600.659236 -499.677094 ... 14.036510 -500.857308 11.999550 -501.193686 9.970366 -499.201640 7.977324 -500.255908 14.005551 -599.996129
4 2016-01-15 04:00:00 66.927016 117.988169 11.913613 7.25 1.009869 1601.339707 -498.975456 1601.437854 -500.323246 ... 14.027298 -499.838632 11.953070 -501.053894 9.925709 -501.686727 7.894242 -500.356035 13.996647 -601.496691
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
22711 2018-08-18 06:59:59 73.755150 123.381787 8.028927 6.50 1.304232 1648.421193 -400.382169 1648.742005 -400.359661 ... 23.031497 -501.167942 20.007571 -499.740028 18.006038 -499.834374 13.001114 -500.155694 20.007840 -501.296428
22712 2018-08-18 07:59:59 69.049291 120.878188 7.962636 6.50 1.302419 1649.820162 -399.930973 1649.357538 -399.721222 ... 22.960095 -501.612783 20.035660 -500.251357 17.998535 -500.395178 12.954048 -499.895163 19.968498 -501.041608
22713 2018-08-18 08:59:59 67.002189 105.666118 7.955111 6.50 1.315926 1649.166761 -399.888631 1649.196904 -399.677571 ... 23.015718 -501.711599 19.951231 -499.857027 18.019543 -500.451156 13.023431 -499.914391 19.990885 -501.518452
22714 2018-08-18 09:59:59 65.523246 98.880538 7.984164 6.50 1.241969 1646.547763 -398.977083 1648.212240 -400.383265 ... 23.024963 -501.153409 20.054122 -500.314711 17.979515 -499.272871 12.992404 -499.976268 20.013986 -500.625471
22715 2018-08-18 10:59:59 70.281454 95.248427 8.078957 6.50 1.283045 1648.759906 -399.862053 1650.135395 -399.957321 ... 23.018622 -500.492702 20.020205 -500.220296 17.963512 -499.939490 12.990306 -500.080993 19.990336 -499.191575

20753 rows × 54 columns

In [32]:
# Training Dataset

# Drop All calculation columns from the Training Dataset

calculation_train = gold_train_new.filter(like="calculation",axis = 1)
gold_train_new1 = gold_train_new.drop(calculation_train, axis=1)

# Drop all output columns from the Training Dataset except for final.output.recovery and rougher.output.recovery
output_train = gold_train_new1.filter(like = "output", axis = 1)

# Make a DF w/ only final.output.recovery and rougher.output.recovery
final_train = gold_train_new1[['final.output.recovery']]

# Drop rougher_final from the output_train DF
output_train = output_train.drop(final_train, axis=1)

# Drop output_train from gold_train_new1 (the new Training Dataset made)
gold_train_new1 = gold_train_new1.drop(output_train, axis = 1)
gold_train_new1

# Move final.output.recovery 
col = 'final.output.recovery'
cols = list(gold_train_new1.columns)
cols.insert(1,cols.pop(cols.index(col)))
gold_train_new1 = gold_train_new1[cols]

gold_train_new1
Out[32]:
date final.output.recovery primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
0 2016-01-15 00:00:00 70.541216 127.092003 10.128295 7.25 0.988759 1549.775757 -498.912140 1551.434204 -516.403442 ... 14.016835 -502.488007 12.099931 -504.715942 9.925633 -498.310211 8.079666 -500.470978 14.151341 -605.841980
1 2016-01-15 01:00:00 69.266198 125.629232 10.296251 7.25 1.002663 1576.166671 -500.904965 1575.950626 -499.865889 ... 13.992281 -505.503262 11.950531 -501.331529 10.039245 -500.169983 7.984757 -500.582168 13.998353 -599.787184
2 2016-01-15 02:00:00 68.116445 123.819808 11.316280 7.25 0.991265 1601.556163 -499.997791 1600.386685 -500.607762 ... 14.015015 -502.520901 11.912783 -501.133383 10.070913 -500.129135 8.013877 -500.517572 14.028663 -601.427363
3 2016-01-15 03:00:00 68.347543 122.270188 11.322140 7.25 0.996739 1599.968720 -500.951778 1600.659236 -499.677094 ... 14.036510 -500.857308 11.999550 -501.193686 9.970366 -499.201640 7.977324 -500.255908 14.005551 -599.996129
4 2016-01-15 04:00:00 66.927016 117.988169 11.913613 7.25 1.009869 1601.339707 -498.975456 1601.437854 -500.323246 ... 14.027298 -499.838632 11.953070 -501.053894 9.925709 -501.686727 7.894242 -500.356035 13.996647 -601.496691
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
16855 2018-08-18 06:59:59 73.755150 123.381787 8.028927 6.50 1.304232 1648.421193 -400.382169 1648.742005 -400.359661 ... 23.031497 -501.167942 20.007571 -499.740028 18.006038 -499.834374 13.001114 -500.155694 20.007840 -501.296428
16856 2018-08-18 07:59:59 69.049291 120.878188 7.962636 6.50 1.302419 1649.820162 -399.930973 1649.357538 -399.721222 ... 22.960095 -501.612783 20.035660 -500.251357 17.998535 -500.395178 12.954048 -499.895163 19.968498 -501.041608
16857 2018-08-18 08:59:59 67.002189 105.666118 7.955111 6.50 1.315926 1649.166761 -399.888631 1649.196904 -399.677571 ... 23.015718 -501.711599 19.951231 -499.857027 18.019543 -500.451156 13.023431 -499.914391 19.990885 -501.518452
16858 2018-08-18 09:59:59 65.523246 98.880538 7.984164 6.50 1.241969 1646.547763 -398.977083 1648.212240 -400.383265 ... 23.024963 -501.153409 20.054122 -500.314711 17.979515 -499.272871 12.992404 -499.976268 20.013986 -500.625471
16859 2018-08-18 10:59:59 70.281454 95.248427 8.078957 6.50 1.283045 1648.759906 -399.862053 1650.135395 -399.957321 ... 23.018622 -500.492702 20.020205 -500.220296 17.963512 -499.939490 12.990306 -500.080993 19.990336 -499.191575

15339 rows × 54 columns

Missing Values¶

In [33]:
# Training Dataset

# Look at the missing values and re-determine percentage
new_missing_train_count1 = gold_train_new1.isna().sum()

# Separates and calculate the pct values
new_missing_count_array1 = (new_missing_train_count1.values / 15339) * 100

# Create a DF w/ just the index(columns in this case) with the missing count
new_missing_train_count_df1 = pd.DataFrame(new_missing_train_count1, columns = ['missing_cnt'])

# Creates a DF w/ just the index(columns in this case) with the missing percent
new_missing_train_count2 = pd.DataFrame(new_missing_count_array1, index = new_missing_train_count_df1.index, columns = ['missing_pct'])

# Combines the 2 DF's to show count and percent missing
new_missing_train_count_df1 = pd.concat([new_missing_train_count_df1,new_missing_train_count2], axis = 1)
display(new_missing_train_count_df1)
missing_cnt missing_pct
date 0 0.000000
final.output.recovery 0 0.000000
primary_cleaner.input.sulfate 381 2.483865
primary_cleaner.input.depressant 257 1.675468
primary_cleaner.input.feed_size 0 0.000000
primary_cleaner.input.xanthate 276 1.799335
primary_cleaner.state.floatbank8_a_air 8 0.052155
primary_cleaner.state.floatbank8_a_level 1 0.006519
primary_cleaner.state.floatbank8_b_air 8 0.052155
primary_cleaner.state.floatbank8_b_level 1 0.006519
primary_cleaner.state.floatbank8_c_air 6 0.039116
primary_cleaner.state.floatbank8_c_level 1 0.006519
primary_cleaner.state.floatbank8_d_air 7 0.045635
primary_cleaner.state.floatbank8_d_level 1 0.006519
rougher.input.feed_ag 0 0.000000
rougher.input.feed_pb 105 0.684530
rougher.input.feed_rate 218 1.421214
rougher.input.feed_size 146 0.951822
rougher.input.feed_sol 143 0.932264
rougher.input.feed_au 0 0.000000
rougher.input.floatbank10_sulfate 375 2.444749
rougher.input.floatbank10_xanthate 91 0.593259
rougher.input.floatbank11_sulfate 357 2.327401
rougher.input.floatbank11_xanthate 779 5.078558
rougher.state.floatbank10_a_air 16 0.104309
rougher.state.floatbank10_a_level 16 0.104309
rougher.state.floatbank10_b_air 16 0.104309
rougher.state.floatbank10_b_level 16 0.104309
rougher.state.floatbank10_c_air 16 0.104309
rougher.state.floatbank10_c_level 16 0.104309
rougher.state.floatbank10_d_air 15 0.097790
rougher.state.floatbank10_d_level 15 0.097790
rougher.state.floatbank10_e_air 532 3.468283
rougher.state.floatbank10_e_level 15 0.097790
rougher.state.floatbank10_f_air 15 0.097790
rougher.state.floatbank10_f_level 15 0.097790
secondary_cleaner.state.floatbank2_a_air 230 1.499446
secondary_cleaner.state.floatbank2_a_level 1 0.006519
secondary_cleaner.state.floatbank2_b_air 24 0.156464
secondary_cleaner.state.floatbank2_b_level 1 0.006519
secondary_cleaner.state.floatbank3_a_air 7 0.045635
secondary_cleaner.state.floatbank3_a_level 1 0.006519
secondary_cleaner.state.floatbank3_b_air 1 0.006519
secondary_cleaner.state.floatbank3_b_level 1 0.006519
secondary_cleaner.state.floatbank4_a_air 9 0.058674
secondary_cleaner.state.floatbank4_a_level 1 0.006519
secondary_cleaner.state.floatbank4_b_air 1 0.006519
secondary_cleaner.state.floatbank4_b_level 1 0.006519
secondary_cleaner.state.floatbank5_a_air 1 0.006519
secondary_cleaner.state.floatbank5_a_level 1 0.006519
secondary_cleaner.state.floatbank5_b_air 1 0.006519
secondary_cleaner.state.floatbank5_b_level 1 0.006519
secondary_cleaner.state.floatbank6_a_air 3 0.019558
secondary_cleaner.state.floatbank6_a_level 1 0.006519
In [34]:
# Find the NaN values that are missing more than 1%
new_missing_train_count_df1[new_missing_train_count_df1['missing_pct'] >= 1.0].sort_values(by='missing_pct', ascending = False)
Out[34]:
missing_cnt missing_pct
rougher.input.floatbank11_xanthate 779 5.078558
rougher.state.floatbank10_e_air 532 3.468283
primary_cleaner.input.sulfate 381 2.483865
rougher.input.floatbank10_sulfate 375 2.444749
rougher.input.floatbank11_sulfate 357 2.327401
primary_cleaner.input.xanthate 276 1.799335
primary_cleaner.input.depressant 257 1.675468
secondary_cleaner.state.floatbank2_a_air 230 1.499446
rougher.input.feed_rate 218 1.421214
In [35]:
# Full Dataset

# Look at the missing values and re-determine percentage
new_missing_full_count1 = gold_full_new1.isna().sum()

# Separates and calculates the pct values
new_missing_count_array2 = (new_missing_full_count1.values / 20753) * 100

# Creates a DF w/ just the index(columns in this case) with the missing count
new_missing_full_count_df1 = pd.DataFrame(new_missing_full_count1, columns = ['missing_cnt'])

# Creates a DF w/ just the index(columns in this case) with the missing percent
new_missing_full_count2 = pd.DataFrame(new_missing_count_array2, index = new_missing_full_count_df1.index, columns = ['missing_pct'])

# Combines the 2 DF's to show count and percent missing
new_missing_full_count_df1 = pd.concat([new_missing_full_count_df1,new_missing_full_count2], axis = 1)
display(new_missing_full_count_df1)
missing_cnt missing_pct
date 0 0.000000
final.output.recovery 0 0.000000
primary_cleaner.input.sulfate 388 1.869609
primary_cleaner.input.depressant 263 1.267287
primary_cleaner.input.feed_size 0 0.000000
primary_cleaner.input.xanthate 282 1.358840
primary_cleaner.state.floatbank8_a_air 8 0.038549
primary_cleaner.state.floatbank8_a_level 1 0.004819
primary_cleaner.state.floatbank8_b_air 8 0.038549
primary_cleaner.state.floatbank8_b_level 1 0.004819
primary_cleaner.state.floatbank8_c_air 6 0.028911
primary_cleaner.state.floatbank8_c_level 1 0.004819
primary_cleaner.state.floatbank8_d_air 7 0.033730
primary_cleaner.state.floatbank8_d_level 1 0.004819
rougher.input.feed_ag 0 0.000000
rougher.input.feed_pb 105 0.505951
rougher.input.feed_rate 221 1.064906
rougher.input.feed_size 147 0.708331
rougher.input.feed_sol 175 0.843252
rougher.input.feed_au 0 0.000000
rougher.input.floatbank10_sulfate 380 1.831061
rougher.input.floatbank10_xanthate 92 0.443309
rougher.input.floatbank11_sulfate 368 1.773238
rougher.input.floatbank11_xanthate 812 3.912687
rougher.state.floatbank10_a_air 16 0.077097
rougher.state.floatbank10_a_level 16 0.077097
rougher.state.floatbank10_b_air 16 0.077097
rougher.state.floatbank10_b_level 16 0.077097
rougher.state.floatbank10_c_air 16 0.077097
rougher.state.floatbank10_c_level 16 0.077097
rougher.state.floatbank10_d_air 15 0.072279
rougher.state.floatbank10_d_level 15 0.072279
rougher.state.floatbank10_e_air 532 2.563485
rougher.state.floatbank10_e_level 15 0.072279
rougher.state.floatbank10_f_air 15 0.072279
rougher.state.floatbank10_f_level 15 0.072279
secondary_cleaner.state.floatbank2_a_air 233 1.122729
secondary_cleaner.state.floatbank2_a_level 1 0.004819
secondary_cleaner.state.floatbank2_b_air 27 0.130102
secondary_cleaner.state.floatbank2_b_level 1 0.004819
secondary_cleaner.state.floatbank3_a_air 17 0.081916
secondary_cleaner.state.floatbank3_a_level 1 0.004819
secondary_cleaner.state.floatbank3_b_air 1 0.004819
secondary_cleaner.state.floatbank3_b_level 1 0.004819
secondary_cleaner.state.floatbank4_a_air 9 0.043367
secondary_cleaner.state.floatbank4_a_level 1 0.004819
secondary_cleaner.state.floatbank4_b_air 1 0.004819
secondary_cleaner.state.floatbank4_b_level 1 0.004819
secondary_cleaner.state.floatbank5_a_air 1 0.004819
secondary_cleaner.state.floatbank5_a_level 1 0.004819
secondary_cleaner.state.floatbank5_b_air 1 0.004819
secondary_cleaner.state.floatbank5_b_level 1 0.004819
secondary_cleaner.state.floatbank6_a_air 3 0.014456
secondary_cleaner.state.floatbank6_a_level 1 0.004819
In [36]:
# Find the NaN values that are missing less than 1%
new_missing_full_count_df1[new_missing_full_count_df1['missing_pct'] >= 1.0].sort_values(by='missing_pct', ascending = False)
Out[36]:
missing_cnt missing_pct
rougher.input.floatbank11_xanthate 812 3.912687
rougher.state.floatbank10_e_air 532 2.563485
primary_cleaner.input.sulfate 388 1.869609
rougher.input.floatbank10_sulfate 380 1.831061
rougher.input.floatbank11_sulfate 368 1.773238
primary_cleaner.input.xanthate 282 1.358840
primary_cleaner.input.depressant 263 1.267287
secondary_cleaner.state.floatbank2_a_air 233 1.122729
rougher.input.feed_rate 221 1.064906

Missing Data Analysis - Cleaned Datasets

Features with ≥ 1% Missing Data

After removing output/calculation columns and final.output.recovery NaN values

Training Dataset

Feature Missing Count Missing %
rougher.input.floatbank11_xanthate 779 5.08%
rougher.state.floatbank10_e_air 532 3.47%
primary_cleaner.input.sulfate 381 2.48%
rougher.input.floatbank10_sulfate 375 2.44%
rougher.input.floatbank11_sulfate 357 2.33%
primary_cleaner.input.xanthate 276 1.80%
primary_cleaner.input.depressant 257 1.68%
secondary_cleaner.state.floatbank2_a_air 230 1.50%
rougher.input.feed_rate 218 1.42%

Full Dataset

Feature Missing Count Missing %
rougher.input.floatbank11_xanthate 812 3.91%
rougher.state.floatbank10_e_air 532 2.56%
primary_cleaner.input.sulfate 388 1.87%
rougher.input.floatbank10_sulfate 380 1.83%
rougher.input.floatbank11_sulfate 368 1.77%
primary_cleaner.input.xanthate 282 1.36%
primary_cleaner.input.depressant 263 1.27%
secondary_cleaner.state.floatbank2_a_air 233 1.12%
rougher.input.feed_rate 221 1.06%

Key Findings

Training dataset shows consistently higher missing data rates even after removing problematic output/calculation features and final.output.recovery NaN values. The differences range from 0.17% to 1.17% higher in Training compared to Full dataset.

Input features dominate the remaining missing data, with rougher.input.floatbank11_xanthate being the most problematic (5.08% Training, 3.91% Full). All remaining features with ≥1% missing data are input or state features that would be used for prediction.

This represents the true scope of missing data challenges for modeling, as these are the features that will actually be needed for predictions and cannot be excluded from the analysis.

Missing Values - < 1% Missing Data¶

In [37]:
# Training Dataset

# Idendtify the rows where the NaN should be dropped
new_missing_train_count_df1[new_missing_train_count_df1['missing_pct'] < 1.0].sort_values(by = 'missing_pct', ascending = False)
Out[37]:
missing_cnt missing_pct
rougher.input.feed_size 146 0.951822
rougher.input.feed_sol 143 0.932264
rougher.input.feed_pb 105 0.684530
rougher.input.floatbank10_xanthate 91 0.593259
secondary_cleaner.state.floatbank2_b_air 24 0.156464
rougher.state.floatbank10_c_air 16 0.104309
rougher.state.floatbank10_b_level 16 0.104309
rougher.state.floatbank10_b_air 16 0.104309
rougher.state.floatbank10_a_level 16 0.104309
rougher.state.floatbank10_a_air 16 0.104309
rougher.state.floatbank10_c_level 16 0.104309
rougher.state.floatbank10_f_level 15 0.097790
rougher.state.floatbank10_f_air 15 0.097790
rougher.state.floatbank10_e_level 15 0.097790
rougher.state.floatbank10_d_level 15 0.097790
rougher.state.floatbank10_d_air 15 0.097790
secondary_cleaner.state.floatbank4_a_air 9 0.058674
primary_cleaner.state.floatbank8_a_air 8 0.052155
primary_cleaner.state.floatbank8_b_air 8 0.052155
secondary_cleaner.state.floatbank3_a_air 7 0.045635
primary_cleaner.state.floatbank8_d_air 7 0.045635
primary_cleaner.state.floatbank8_c_air 6 0.039116
secondary_cleaner.state.floatbank6_a_air 3 0.019558
secondary_cleaner.state.floatbank5_b_level 1 0.006519
secondary_cleaner.state.floatbank5_a_air 1 0.006519
secondary_cleaner.state.floatbank4_a_level 1 0.006519
secondary_cleaner.state.floatbank5_b_air 1 0.006519
secondary_cleaner.state.floatbank3_b_air 1 0.006519
secondary_cleaner.state.floatbank4_b_air 1 0.006519
secondary_cleaner.state.floatbank5_a_level 1 0.006519
secondary_cleaner.state.floatbank3_b_level 1 0.006519
secondary_cleaner.state.floatbank4_b_level 1 0.006519
secondary_cleaner.state.floatbank6_a_level 1 0.006519
secondary_cleaner.state.floatbank3_a_level 1 0.006519
secondary_cleaner.state.floatbank2_b_level 1 0.006519
secondary_cleaner.state.floatbank2_a_level 1 0.006519
primary_cleaner.state.floatbank8_d_level 1 0.006519
primary_cleaner.state.floatbank8_c_level 1 0.006519
primary_cleaner.state.floatbank8_b_level 1 0.006519
primary_cleaner.state.floatbank8_a_level 1 0.006519
final.output.recovery 0 0.000000
rougher.input.feed_au 0 0.000000
rougher.input.feed_ag 0 0.000000
primary_cleaner.input.feed_size 0 0.000000
date 0 0.000000
In [38]:
# Training Dataset

# Get rid of the rows that have less than 1% of NaN values

index_train = new_missing_train_count_df1[new_missing_train_count_df1['missing_pct'] < 1.0].sort_values(
    by = 'missing_pct', ascending = False).index

# Turn the columns w/ less than 1% to a list
index_train = list(index_train)

# Drop the NaN values with less than 1% of NaN values
gold_train_new2 = gold_train_new1.dropna(subset = index_train)

# Look at the NaN values in the DF
gold_train_new2.isna().sum().sort_values(ascending = False)
Out[38]:
rougher.input.floatbank11_xanthate            654
rougher.state.floatbank10_e_air               508
rougher.input.floatbank11_sulfate             262
primary_cleaner.input.sulfate                 259
rougher.input.floatbank10_sulfate             259
secondary_cleaner.state.floatbank2_a_air      220
primary_cleaner.input.xanthate                190
primary_cleaner.input.depressant              189
rougher.input.feed_rate                       183
rougher.state.floatbank10_f_air                 0
rougher.state.floatbank10_f_level               0
secondary_cleaner.state.floatbank3_a_level      0
secondary_cleaner.state.floatbank2_a_level      0
rougher.state.floatbank10_e_level               0
secondary_cleaner.state.floatbank2_b_air        0
secondary_cleaner.state.floatbank2_b_level      0
secondary_cleaner.state.floatbank3_a_air        0
date                                            0
rougher.state.floatbank10_d_level               0
secondary_cleaner.state.floatbank3_b_level      0
secondary_cleaner.state.floatbank4_a_air        0
secondary_cleaner.state.floatbank4_a_level      0
secondary_cleaner.state.floatbank4_b_air        0
secondary_cleaner.state.floatbank4_b_level      0
secondary_cleaner.state.floatbank5_a_air        0
secondary_cleaner.state.floatbank5_a_level      0
secondary_cleaner.state.floatbank5_b_air        0
secondary_cleaner.state.floatbank5_b_level      0
secondary_cleaner.state.floatbank6_a_air        0
secondary_cleaner.state.floatbank3_b_air        0
rougher.state.floatbank10_b_level               0
rougher.state.floatbank10_d_air                 0
rougher.input.feed_ag                           0
primary_cleaner.input.feed_size                 0
primary_cleaner.state.floatbank8_a_air          0
primary_cleaner.state.floatbank8_a_level        0
primary_cleaner.state.floatbank8_b_air          0
primary_cleaner.state.floatbank8_b_level        0
primary_cleaner.state.floatbank8_c_air          0
primary_cleaner.state.floatbank8_c_level        0
primary_cleaner.state.floatbank8_d_air          0
primary_cleaner.state.floatbank8_d_level        0
rougher.input.feed_pb                           0
rougher.state.floatbank10_c_level               0
rougher.input.feed_size                         0
rougher.input.feed_sol                          0
rougher.input.feed_au                           0
rougher.input.floatbank10_xanthate              0
rougher.state.floatbank10_a_air                 0
rougher.state.floatbank10_a_level               0
rougher.state.floatbank10_b_air                 0
final.output.recovery                           0
rougher.state.floatbank10_c_air                 0
secondary_cleaner.state.floatbank6_a_level      0
dtype: int64
In [39]:
# All values are still above 1%; thus imputation is necessary for the rest of the NaN values

display(654 / 14855 * 100)
display(508 / 14855 * 100)
display(262 / 14855 * 100)
display(259 / 14855 * 100)
display(259 / 14855 * 100)
display(220 / 14855 * 100)
display(190 / 14855 * 100)
display(189 / 14855 * 100)
display(183 / 14855 * 100)
4.402558061258835
3.4197239986536516
1.7637159205654662
1.743520700100976
1.743520700100976
1.4809828340626052
1.2790306294177045
1.2722988892628744
1.2319084483338942
In [40]:
# Full Dataset

# Idendtify the rows where the NaN should be dropped

new_missing_full_count_df1[new_missing_full_count_df1['missing_pct'] < 1.0].sort_values(by = 'missing_pct', ascending = False)
Out[40]:
missing_cnt missing_pct
rougher.input.feed_sol 175 0.843252
rougher.input.feed_size 147 0.708331
rougher.input.feed_pb 105 0.505951
rougher.input.floatbank10_xanthate 92 0.443309
secondary_cleaner.state.floatbank2_b_air 27 0.130102
secondary_cleaner.state.floatbank3_a_air 17 0.081916
rougher.state.floatbank10_b_air 16 0.077097
rougher.state.floatbank10_a_air 16 0.077097
rougher.state.floatbank10_c_air 16 0.077097
rougher.state.floatbank10_b_level 16 0.077097
rougher.state.floatbank10_a_level 16 0.077097
rougher.state.floatbank10_c_level 16 0.077097
rougher.state.floatbank10_d_level 15 0.072279
rougher.state.floatbank10_f_level 15 0.072279
rougher.state.floatbank10_f_air 15 0.072279
rougher.state.floatbank10_e_level 15 0.072279
rougher.state.floatbank10_d_air 15 0.072279
secondary_cleaner.state.floatbank4_a_air 9 0.043367
primary_cleaner.state.floatbank8_b_air 8 0.038549
primary_cleaner.state.floatbank8_a_air 8 0.038549
primary_cleaner.state.floatbank8_d_air 7 0.033730
primary_cleaner.state.floatbank8_c_air 6 0.028911
secondary_cleaner.state.floatbank6_a_air 3 0.014456
secondary_cleaner.state.floatbank3_b_level 1 0.004819
secondary_cleaner.state.floatbank4_a_level 1 0.004819
secondary_cleaner.state.floatbank5_b_level 1 0.004819
secondary_cleaner.state.floatbank5_a_level 1 0.004819
secondary_cleaner.state.floatbank5_a_air 1 0.004819
secondary_cleaner.state.floatbank4_b_level 1 0.004819
secondary_cleaner.state.floatbank3_b_air 1 0.004819
secondary_cleaner.state.floatbank4_b_air 1 0.004819
secondary_cleaner.state.floatbank5_b_air 1 0.004819
secondary_cleaner.state.floatbank6_a_level 1 0.004819
secondary_cleaner.state.floatbank3_a_level 1 0.004819
secondary_cleaner.state.floatbank2_b_level 1 0.004819
secondary_cleaner.state.floatbank2_a_level 1 0.004819
primary_cleaner.state.floatbank8_d_level 1 0.004819
primary_cleaner.state.floatbank8_c_level 1 0.004819
primary_cleaner.state.floatbank8_b_level 1 0.004819
primary_cleaner.state.floatbank8_a_level 1 0.004819
final.output.recovery 0 0.000000
rougher.input.feed_au 0 0.000000
rougher.input.feed_ag 0 0.000000
primary_cleaner.input.feed_size 0 0.000000
date 0 0.000000
In [41]:
# Full Dataset

# Get rid of the rows that have less than 1% of NaN values

index_full = new_missing_full_count_df1[new_missing_full_count_df1['missing_pct'] < 1.0].sort_values(
    by = 'missing_pct', ascending = False).index

# Turn the columns w/ less than 1% to a list
index_full = list(index_full)

# Drop the NaN values with less than 1% of NaN values
gold_full_new2 = gold_full_new1.dropna(subset = index_full)

# Look at the NaN values in the DF
gold_full_new2.isna().sum().sort_values(ascending = False)
Out[41]:
rougher.input.floatbank11_xanthate            684
rougher.state.floatbank10_e_air               508
rougher.input.floatbank11_sulfate             272
primary_cleaner.input.sulfate                 263
rougher.input.floatbank10_sulfate             261
secondary_cleaner.state.floatbank2_a_air      223
primary_cleaner.input.xanthate                193
primary_cleaner.input.depressant              193
rougher.input.feed_rate                       185
rougher.state.floatbank10_f_air                 0
rougher.state.floatbank10_f_level               0
secondary_cleaner.state.floatbank3_a_level      0
secondary_cleaner.state.floatbank2_a_level      0
rougher.state.floatbank10_e_level               0
secondary_cleaner.state.floatbank2_b_air        0
secondary_cleaner.state.floatbank2_b_level      0
secondary_cleaner.state.floatbank3_a_air        0
date                                            0
rougher.state.floatbank10_d_level               0
secondary_cleaner.state.floatbank3_b_level      0
secondary_cleaner.state.floatbank4_a_air        0
secondary_cleaner.state.floatbank4_a_level      0
secondary_cleaner.state.floatbank4_b_air        0
secondary_cleaner.state.floatbank4_b_level      0
secondary_cleaner.state.floatbank5_a_air        0
secondary_cleaner.state.floatbank5_a_level      0
secondary_cleaner.state.floatbank5_b_air        0
secondary_cleaner.state.floatbank5_b_level      0
secondary_cleaner.state.floatbank6_a_air        0
secondary_cleaner.state.floatbank3_b_air        0
rougher.state.floatbank10_b_level               0
rougher.state.floatbank10_d_air                 0
rougher.input.feed_ag                           0
primary_cleaner.input.feed_size                 0
primary_cleaner.state.floatbank8_a_air          0
primary_cleaner.state.floatbank8_a_level        0
primary_cleaner.state.floatbank8_b_air          0
primary_cleaner.state.floatbank8_b_level        0
primary_cleaner.state.floatbank8_c_air          0
primary_cleaner.state.floatbank8_c_level        0
primary_cleaner.state.floatbank8_d_air          0
primary_cleaner.state.floatbank8_d_level        0
rougher.input.feed_pb                           0
rougher.state.floatbank10_c_level               0
rougher.input.feed_size                         0
rougher.input.feed_sol                          0
rougher.input.feed_au                           0
rougher.input.floatbank10_xanthate              0
rougher.state.floatbank10_a_air                 0
rougher.state.floatbank10_a_level               0
rougher.state.floatbank10_b_air                 0
final.output.recovery                           0
rougher.state.floatbank10_c_air                 0
secondary_cleaner.state.floatbank6_a_level      0
dtype: int64
In [42]:
# Although some values are below 1%, to keep aligned with training set, these will also be imputated

display(684 / 20226 * 100)
display(508 / 20226 * 100)
display(272 / 20226 * 100)
display(263 / 20226 * 100)
display(261 / 20226 * 100)
display(223 / 20226 * 100)
display(193 / 20226 * 100)
display(193 / 20226 * 100)
display(185 / 20226 * 100)
3.3817858202313853
2.5116187085929003
1.3448037179867498
1.3003065361416
1.2904182735093443
1.1025412834964896
0.954217344012657
0.954217344012657
0.9146642934836349

Missing Data Analysis - Updated Cleaned Datasets

Training Dataset

After removing output/calculation columns, final.output.recovery NaN values, AND features with <1% missing data

Dataset Size: 14,855 rows

Feature Missing Count Missing %
rougher.input.floatbank11_xanthate 654 4.40%
rougher.state.floatbank10_e_air 508 3.42%
rougher.input.floatbank11_sulfate 262 1.76%
primary_cleaner.input.sulfate 259 1.74%
rougher.input.floatbank10_sulfate 259 1.74%
secondary_cleaner.state.floatbank2_a_air 220 1.48%
primary_cleaner.input.xanthate 190 1.28%
primary_cleaner.input.depressant 189 1.27%
rougher.input.feed_rate 183 1.23%

Full Dataset

After removing output/calculation columns, final.output.recovery NaN values, AND features with <1% missing data

Dataset Size: 20,226 rows

Feature Missing Count Missing %
rougher.input.floatbank11_xanthate 684 3.38%
rougher.state.floatbank10_e_air 508 2.51%
rougher.input.floatbank11_sulfate 272 1.34%
primary_cleaner.input.sulfate 263 1.30%
rougher.input.floatbank10_sulfate 261 1.29%
secondary_cleaner.state.floatbank2_a_air 223 1.10%
primary_cleaner.input.xanthate 193 0.95%
primary_cleaner.input.depressant 193 0.95%
rougher.input.feed_rate 185 0.91%

Conclusion Summary

Final scope: exactly 9 features need missing data attention. After aggressive cleaning, these are the only features with meaningful missing data gaps. Training dataset consistently shows 0.32-1.02% higher missing rates than Full dataset.

Rougher.input.floatbank11_xanthate remains the main challenge at 4.4% missing - everything else is under 2%. Standard imputation will handle this easily. Clean problem, clean solution.

Missing Values - Imputation¶

In [43]:
# Training Dataset

# Investigate the 9 missing columns

imp_cols = ['rougher.input.floatbank11_xanthate','rougher.state.floatbank10_e_air','rougher.input.floatbank11_sulfate',
'primary_cleaner.input.sulfate','rougher.input.floatbank10_sulfate','secondary_cleaner.state.floatbank2_a_air',
'primary_cleaner.input.xanthate','primary_cleaner.input.depressant','rougher.input.feed_rate']	

# Create a histogram for the 9 features with missing values
gold_train_new2[imp_cols].hist(figsize = [15,10])
plt.show()

display(gold_train_new2[imp_cols].mean())
print()
display(gold_train_new2[imp_cols].median())
print()
display(gold_train_new2[imp_cols].describe())
No description has been provided for this image
rougher.input.floatbank11_xanthate             5.888175
rougher.state.floatbank10_e_air             1078.990959
rougher.input.floatbank11_sulfate             11.389359
primary_cleaner.input.sulfate                134.643263
rougher.input.floatbank10_sulfate             11.763041
secondary_cleaner.state.floatbank2_a_air      29.341060
primary_cleaner.input.xanthate                 0.879605
primary_cleaner.input.depressant               8.929893
rougher.input.feed_rate                      467.518312
dtype: float64

rougher.input.floatbank11_xanthate             5.999159
rougher.state.floatbank10_e_air             1049.988150
rougher.input.floatbank11_sulfate             11.414551
primary_cleaner.input.sulfate                134.315614
rougher.input.floatbank10_sulfate             11.708082
secondary_cleaner.state.floatbank2_a_air      30.017078
primary_cleaner.input.xanthate                 0.888980
primary_cleaner.input.depressant               8.043323
rougher.input.feed_rate                      498.403275
dtype: float64

rougher.input.floatbank11_xanthate rougher.state.floatbank10_e_air rougher.input.floatbank11_sulfate primary_cleaner.input.sulfate rougher.input.floatbank10_sulfate secondary_cleaner.state.floatbank2_a_air primary_cleaner.input.xanthate primary_cleaner.input.depressant rougher.input.feed_rate
count 14201.000000 14347.000000 14593.000000 14596.000000 14596.000000 14635.000000 14665.000000 14666.000000 14672.000000
mean 5.888175 1078.990959 11.389359 134.643263 11.763041 29.341060 0.879605 8.929893 467.518312
std 1.149914 202.268601 3.734254 39.951815 3.313112 6.334380 0.386629 3.437355 111.056786
min 0.000290 -1.970275 0.000049 0.003112 0.000044 0.077503 0.000006 0.000000 0.001166
25% 5.198196 951.610899 9.506037 107.939017 9.855929 25.094668 0.601735 6.051069 408.014630
50% 5.999159 1049.988150 11.414551 134.315614 11.708082 30.017078 0.888980 8.043323 498.403275
75% 6.700692 1199.872051 13.500338 161.428026 13.684285 34.878417 1.101912 11.912753 545.801298
max 9.698407 1922.636637 37.980648 251.999948 36.118275 60.000000 2.512968 20.673152 717.508837
In [44]:
# View columns that look similar to the missing features (rougher.input.floatbank##_xanthate) 

# Look for columns with rougher.input.floatbank and then drop the columns that do not have rougher.input.floatbank##_xanthate
xanthate = gold_train_new2.filter(like="rougher.input.floatbank", axis = 1).drop(
    gold_train_new2[['rougher.input.floatbank11_sulfate','rougher.input.floatbank10_sulfate']], axis = 1)

# Compare rougher.input.floatbank##_xanthate columns (subtract)
xanthate_difference = xanthate['rougher.input.floatbank10_xanthate'] - xanthate['rougher.input.floatbank11_xanthate']

# Create a DF to compare
xanthate['xanthate_difference'] = xanthate_difference



# Group the differences and compare statistics

print("Less Than -1")
display(xanthate[xanthate['xanthate_difference'] < -1].median())
display(xanthate[xanthate['xanthate_difference'] < -1].mean())
display(xanthate[xanthate['xanthate_difference'] < -1].max())
display(xanthate[xanthate['xanthate_difference'] < -1].min())
print(len(xanthate[xanthate['xanthate_difference'] <- 1]))

print()
print("Greater Than 1")
display(xanthate[xanthate['xanthate_difference'] > 1].median())
display(xanthate[xanthate['xanthate_difference'] > 1].mean())
display(xanthate[xanthate['xanthate_difference'] > 1].max())
display(xanthate[xanthate['xanthate_difference'] > 1].min())
display(len(xanthate[xanthate['xanthate_difference'] > 1]))

print()
print("Between 0 - 1")
display(xanthate[(xanthate['xanthate_difference'] > 0) & (xanthate['xanthate_difference'] <= 1)].median())
display(xanthate[(xanthate['xanthate_difference'] > 0) & (xanthate['xanthate_difference'] <= 1)].mean())
display(xanthate[(xanthate['xanthate_difference'] > 0) & (xanthate['xanthate_difference'] <= 1)].max())
display(xanthate[(xanthate['xanthate_difference'] > 0) & (xanthate['xanthate_difference'] <= 1)].min())
display(len(xanthate[(xanthate['xanthate_difference'] > 0) & (xanthate['xanthate_difference'] <= 1)]))

print()
print("Between 0 - (-)1")
display(xanthate[(xanthate['xanthate_difference'] < 0) & (xanthate['xanthate_difference'] >= -1)].median())
display(xanthate[(xanthate['xanthate_difference'] < 0) & (xanthate['xanthate_difference'] >= -1)].mean())
display(xanthate[(xanthate['xanthate_difference'] < 0) & (xanthate['xanthate_difference'] >= -1)].max())
display(xanthate[(xanthate['xanthate_difference'] < 0) & (xanthate['xanthate_difference'] >= -1)].min())
display(len(xanthate[(xanthate['xanthate_difference'] < 0) & (xanthate['xanthate_difference'] >= -1)]))
print()
print("All")
display(xanthate[xanthate['xanthate_difference'] > -20].median())
display(xanthate[xanthate['xanthate_difference'] > -20].mean())
display(xanthate[xanthate['xanthate_difference'] > -20].max())
display(xanthate[xanthate['xanthate_difference'] > -20].min())
Less Than -1
rougher.input.floatbank10_xanthate    6.243166
rougher.input.floatbank11_xanthate    7.493640
xanthate_difference                  -1.416374
dtype: float64
rougher.input.floatbank10_xanthate    4.437777
rougher.input.floatbank11_xanthate    6.770322
xanthate_difference                  -2.332545
dtype: float64
rougher.input.floatbank10_xanthate    7.624764
rougher.input.floatbank11_xanthate    8.833521
xanthate_difference                  -1.001298
dtype: float64
rougher.input.floatbank10_xanthate    0.000508
rougher.input.floatbank11_xanthate    1.482072
xanthate_difference                  -8.004946
dtype: float64
174

Greater Than 1
rougher.input.floatbank10_xanthate    5.781152
rougher.input.floatbank11_xanthate    1.931478
xanthate_difference                   3.203859
dtype: float64
rougher.input.floatbank10_xanthate    5.702966
rougher.input.floatbank11_xanthate    2.413444
xanthate_difference                   3.289522
dtype: float64
rougher.input.floatbank10_xanthate    8.036454
rougher.input.floatbank11_xanthate    6.318315
xanthate_difference                   7.576177
dtype: float64
rougher.input.floatbank10_xanthate    3.108537
rougher.input.floatbank11_xanthate    0.000290
xanthate_difference                   1.026330
dtype: float64
93
Between 0 - 1
rougher.input.floatbank10_xanthate    6.000332
rougher.input.floatbank11_xanthate    5.998025
xanthate_difference                   0.002270
dtype: float64
rougher.input.floatbank10_xanthate    5.908348
rougher.input.floatbank11_xanthate    5.898371
xanthate_difference                   0.009977
dtype: float64
rougher.input.floatbank10_xanthate    9.703448
rougher.input.floatbank11_xanthate    9.698407
xanthate_difference                   0.953365
dtype: float64
rougher.input.floatbank10_xanthate    4.394071e-03
rougher.input.floatbank11_xanthate    2.024813e-03
xanthate_difference                   2.598773e-07
dtype: float64
6850
Between 0 - (-)1
rougher.input.floatbank10_xanthate    5.995132
rougher.input.floatbank11_xanthate    6.000465
xanthate_difference                  -0.002462
dtype: float64
rougher.input.floatbank10_xanthate    5.858244
rougher.input.floatbank11_xanthate    5.902265
xanthate_difference                  -0.044021
dtype: float64
rougher.input.floatbank10_xanthate    9.655247
rougher.input.floatbank11_xanthate    9.667279
xanthate_difference                  -0.000001
dtype: float64
rougher.input.floatbank10_xanthate    0.000886
rougher.input.floatbank11_xanthate    0.001334
xanthate_difference                  -0.999544
dtype: float64
7084
All
rougher.input.floatbank10_xanthate    5.998175
rougher.input.floatbank11_xanthate    5.999159
xanthate_difference                  -0.000075
dtype: float64
rougher.input.floatbank10_xanthate    5.863991
rougher.input.floatbank11_xanthate    5.888175
xanthate_difference                  -0.024184
dtype: float64
rougher.input.floatbank10_xanthate    9.703448
rougher.input.floatbank11_xanthate    9.698407
xanthate_difference                   7.576177
dtype: float64
rougher.input.floatbank10_xanthate    0.000508
rougher.input.floatbank11_xanthate    0.000290
xanthate_difference                  -8.004946
dtype: float64
In [45]:
# Full Dataset

# Create a histogram for the 9 features with missing values
gold_full_new2[imp_cols].hist(figsize = [15,10])
plt.show()

display(gold_full_new2[imp_cols].mean())
print()
display(gold_full_new2[imp_cols].median())
print()
display(gold_full_new2[imp_cols].describe())
No description has been provided for this image
rougher.input.floatbank11_xanthate             6.058794
rougher.state.floatbank10_e_air             1072.584818
rougher.input.floatbank11_sulfate             12.069147
primary_cleaner.input.sulfate                145.346889
rougher.input.floatbank10_sulfate             12.314856
secondary_cleaner.state.floatbank2_a_air      28.535827
primary_cleaner.input.xanthate                 1.013894
primary_cleaner.input.depressant               8.863749
rougher.input.feed_rate                      473.234067
dtype: float64

rougher.input.floatbank11_xanthate             6.099767
rougher.state.floatbank10_e_air             1049.694499
rougher.input.floatbank11_sulfate             12.000267
primary_cleaner.input.sulfate                144.133440
rougher.input.floatbank10_sulfate             12.001148
secondary_cleaner.state.floatbank2_a_air      29.086048
primary_cleaner.input.xanthate                 0.942111
primary_cleaner.input.depressant               8.044753
rougher.input.feed_rate                      498.480665
dtype: float64

rougher.input.floatbank11_xanthate rougher.state.floatbank10_e_air rougher.input.floatbank11_sulfate primary_cleaner.input.sulfate rougher.input.floatbank10_sulfate secondary_cleaner.state.floatbank2_a_air primary_cleaner.input.xanthate primary_cleaner.input.depressant rougher.input.feed_rate
count 19542.000000 19718.000000 19954.000000 19963.000000 19965.000000 20003.000000 20033.000000 20033.000000 20041.000000
mean 6.058794 1072.584818 12.069147 145.346889 12.314856 28.535827 1.013894 8.863749 473.234067
std 1.116986 185.794118 3.780202 44.634608 3.469103 5.850875 0.519693 3.345438 110.765504
min 0.000290 -1.970275 0.000049 0.003112 0.000044 0.025693 0.000006 0.000000 0.001166
25% 5.405188 997.822139 9.998294 114.747722 10.000173 25.047215 0.694630 6.095366 407.138551
50% 6.099767 1049.694499 12.000267 144.133440 12.001148 29.086048 0.942111 8.044753 498.480665
75% 6.801447 1199.174819 14.576829 175.825511 14.679636 32.987186 1.207878 11.030483 549.840741
max 9.698407 1922.636637 37.980648 265.983123 36.118275 60.000000 4.102454 20.673152 717.508837
In [46]:
# View columns that look similar to the missing features (rougher.input.floatbank##_xanthate) - Full Dataset

# Look for columns with rougher.input.floatbank and then drop the columns that do not have rougher.input.floatbank##_xanthate

xanthate_full = gold_full_new2.filter(like="rougher.input.floatbank", axis = 1).drop(
    gold_full_new2[['rougher.input.floatbank11_sulfate','rougher.input.floatbank10_sulfate']], axis = 1)

# Compare rougher.input.floatbank##_xanthate columns (subtract)
xanthate_difference_full = xanthate_full['rougher.input.floatbank10_xanthate'] - xanthate_full['rougher.input.floatbank11_xanthate']

# Create a DF to compare
xanthate_full['xanthate_difference'] = xanthate_difference_full

# Check the DF's median difference to check if it aligns with the Training Dastaset
display(xanthate_full.median())
display(xanthate_full[xanthate_full['xanthate_difference'] > -20].median())
rougher.input.floatbank10_xanthate    6.004194
rougher.input.floatbank11_xanthate    6.099767
xanthate_difference                  -0.000087
dtype: float64
rougher.input.floatbank10_xanthate    6.098051
rougher.input.floatbank11_xanthate    6.099767
xanthate_difference                  -0.000087
dtype: float64
In [47]:
# Training Dataset

# Impute the 'rougher.input.floatbank11_xanthate' NaN values to match the coordinating 'rougher.input.floatbank10_xanthate' values

display(gold_train_new2[gold_train_new2['rougher.input.floatbank11_xanthate'].isna()])

# Fill the NaN values for floatbank 11 as the corresponding floatbank 10 values
gold_train_new2 = gold_train_new2.copy()

mask = gold_train_new2['rougher.input.floatbank11_xanthate']
gold_train_new2['rougher.input.floatbank11_xanthate'] = (mask.fillna(gold_train_new2['rougher.input.floatbank10_xanthate']))

# Ensure no more NaN values for rougher.input.floatbank11_xanthate
display(gold_train_new2[gold_train_new2['rougher.input.floatbank11_xanthate'].isna()])

# Test it, ensure correct values placed correctly
test = gold_train_new2[['rougher.input.floatbank11_xanthate','rougher.input.floatbank10_xanthate']]
test[2495:]
date final.output.recovery primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
2918 2016-05-15 13:59:59 79.975380 159.551841 11.972081 7.740000 0.904101 1203.830417 -499.751674 1204.874868 -499.591962 ... 9.989485 -501.018930 8.028083 -504.302157 8.806407 -499.302177 5.980273 -500.507677 20.006564 -496.164292
2919 2016-05-15 14:59:59 45.586191 58.717323 12.027662 6.610000 0.922359 1392.770893 -500.326034 1299.437987 -499.958419 ... 10.012410 -501.591840 7.907263 -513.814631 8.568602 -498.174475 5.968032 -499.963947 19.985684 -503.306448
3446 2016-06-06 13:59:59 46.375447 0.832935 NaN 6.800000 0.010702 1191.181589 -502.140749 1192.766017 -494.615276 ... 10.016232 -591.105501 8.034319 -547.173732 7.957247 -495.834105 6.025972 -500.221367 18.015904 -495.628009
4455 2016-07-18 14:59:59 52.576792 62.830953 7.006670 8.210000 0.574743 1601.946035 -500.703455 1607.720913 -500.257229 ... 15.012221 -401.688850 4.974202 -400.312056 10.051200 -403.969189 5.039952 -400.190704 23.012079 -501.597792
5660 2017-01-06 19:59:59 67.275195 NaN 0.416502 7.790000 NaN 1854.162080 -615.091268 2114.906758 -593.303764 ... 0.000000 -530.197868 0.000000 -799.619081 -0.112612 -792.794538 0.646208 -515.392892 0.238511 -809.398668
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
16502 2018-08-03 13:59:59 100.000000 NaN NaN 7.280000 NaN 1650.905789 -494.820004 1648.101136 -570.593484 ... 22.024408 -495.583514 16.985811 -499.811230 17.012212 -504.164124 12.998162 -500.848297 17.994886 -618.618605
16598 2018-08-07 13:59:59 100.000000 0.089524 NaN 7.004999 0.004911 998.890138 -763.169284 1069.561022 -799.562139 ... 19.778919 -691.281395 16.481060 -507.062871 16.710791 -752.526118 13.087867 -799.649084 16.549049 -776.686885
16599 2018-08-07 14:59:59 100.000000 0.017954 NaN 7.320000 0.012019 1602.086817 -583.850387 1870.308253 -667.950493 ... 26.990625 -506.292971 23.023367 -499.713257 22.950561 -587.299584 17.938124 -640.422102 22.951411 -625.940102
16600 2018-08-07 15:59:59 100.000000 0.042164 0.020230 7.320000 0.003251 1579.144432 -483.492915 1845.758204 -505.822619 ... 26.981932 -240.370737 22.948309 -161.049087 22.945266 -301.530549 17.774928 -321.616150 22.941181 -339.193550
16608 2018-08-07 23:59:59 100.000000 NaN 0.133502 7.320000 0.005070 1567.512693 -400.292034 1873.332954 -399.589371 ... 26.975033 -502.882224 23.075104 -499.854192 22.984960 -539.538899 17.997212 -499.904563 22.987740 -625.612323

654 rows × 54 columns

date final.output.recovery primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level

0 rows × 54 columns

Out[47]:
rougher.input.floatbank11_xanthate rougher.input.floatbank10_xanthate
2916 4.403525 4.403645
2917 3.911547 4.400047
2918 5.418245 5.418245
2919 4.744994 4.744994
2920 4.476183 4.755170
... ... ...
16855 9.156069 9.158609
16856 9.297924 9.304952
16857 9.300133 9.299606
16858 9.297194 9.297709
16859 9.309852 9.308612

12360 rows × 2 columns

In [48]:
# Full Dataset

# Impute the 'rougher.input.floatbank11_xanthate' NaN values to match the coordinating 'rougher.input.floatbank10_xanthate' values

display(gold_full_new2[gold_full_new2['rougher.input.floatbank11_xanthate'].isna()])

# Fill the NaN values for floatbank 11 as the corresponding floatbank 10 values
gold_full_new2 = gold_full_new2.copy()

mask1 = gold_full_new2['rougher.input.floatbank11_xanthate']
gold_full_new2['rougher.input.floatbank11_xanthate'] = (mask1.fillna(gold_full_new2['rougher.input.floatbank10_xanthate']))

# Ensure no more NaN values for rougher.input.floatbank11_xanthate
display(gold_full_new2[gold_full_new2['rougher.input.floatbank11_xanthate'].isna()])

# Test it, ensure correct values placed correctly
test1 = gold_full_new2[['rougher.input.floatbank11_xanthate','rougher.input.floatbank10_xanthate']]
test1[2495:]
date final.output.recovery primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
2918 2016-05-15 13:59:59 79.975380 159.551841 11.972081 7.740000 0.904101 1203.830417 -499.751674 1204.874868 -499.591962 ... 9.989485 -501.018930 8.028083 -504.302157 8.806407 -499.302177 5.980273 -500.507677 20.006564 -496.164292
2919 2016-05-15 14:59:59 45.586191 58.717323 12.027662 6.610000 0.922359 1392.770893 -500.326034 1299.437987 -499.958419 ... 10.012410 -501.591840 7.907263 -513.814631 8.568602 -498.174475 5.968032 -499.963947 19.985684 -503.306448
3446 2016-06-06 13:59:59 46.375447 0.832935 NaN 6.800000 0.010702 1191.181589 -502.140749 1192.766017 -494.615276 ... 10.016232 -591.105501 8.034319 -547.173732 7.957247 -495.834105 6.025972 -500.221367 18.015904 -495.628009
4455 2016-07-18 14:59:59 52.576792 62.830953 7.006670 8.210000 0.574743 1601.946035 -500.703455 1607.720913 -500.257229 ... 15.012221 -401.688850 4.974202 -400.312056 10.051200 -403.969189 5.039952 -400.190704 23.012079 -501.597792
6044 2016-09-22 19:59:59 41.146342 31.604122 0.392449 7.680000 2.127074 1601.265098 -499.012130 1600.804614 -497.257896 ... 11.985557 -566.867567 9.932428 -506.285964 9.938096 -508.042363 5.003788 -501.182628 20.020713 -505.117458
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
22358 2018-08-03 13:59:59 100.000000 NaN NaN 7.280000 NaN 1650.905789 -494.820004 1648.101136 -570.593484 ... 22.024408 -495.583514 16.985811 -499.811230 17.012212 -504.164124 12.998162 -500.848297 17.994886 -618.618605
22454 2018-08-07 13:59:59 100.000000 0.089524 NaN 7.004999 0.004911 998.890138 -763.169284 1069.561022 -799.562139 ... 19.778919 -691.281395 16.481060 -507.062871 16.710791 -752.526118 13.087867 -799.649084 16.549049 -776.686885
22455 2018-08-07 14:59:59 100.000000 0.017954 NaN 7.320000 0.012019 1602.086817 -583.850387 1870.308253 -667.950493 ... 26.990625 -506.292971 23.023367 -499.713257 22.950561 -587.299584 17.938124 -640.422102 22.951411 -625.940102
22456 2018-08-07 15:59:59 100.000000 0.042164 0.020230 7.320000 0.003251 1579.144432 -483.492915 1845.758204 -505.822619 ... 26.981932 -240.370737 22.948309 -161.049087 22.945266 -301.530549 17.774928 -321.616150 22.941181 -339.193550
22464 2018-08-07 23:59:59 100.000000 NaN 0.133502 7.320000 0.005070 1567.512693 -400.292034 1873.332954 -399.589371 ... 26.975033 -502.882224 23.075104 -499.854192 22.984960 -539.538899 17.997212 -499.904563 22.987740 -625.612323

684 rows × 54 columns

date final.output.recovery primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level

0 rows × 54 columns

Out[48]:
rougher.input.floatbank11_xanthate rougher.input.floatbank10_xanthate
2916 4.403525 4.403645
2917 3.911547 4.400047
2918 5.418245 5.418245
2919 4.744994 4.744994
2920 4.476183 4.755170
... ... ...
22711 9.156069 9.158609
22712 9.297924 9.304952
22713 9.300133 9.299606
22714 9.297194 9.297709
22715 9.309852 9.308612

17731 rows × 2 columns

In [49]:
# Test Dataset

# Impute the 'rougher.input.floatbank11_xanthate' NaN values to match the coordinating 'rougher.input.floatbank10_xanthate' values

gold_test_new2 = gold_test

display(gold_test_new2[gold_test_new2['rougher.input.floatbank11_xanthate'].isna()])

# Fill the NaN values for floatbank 11 as the corresponding floatbank 10 values
mask_test_set = gold_test_new2['rougher.input.floatbank11_xanthate']
gold_test_new2['rougher.input.floatbank11_xanthate'] = (mask_test_set.fillna(gold_test_new2['rougher.input.floatbank10_xanthate']))


# Check how many rows were filled
with pd.option_context('display.max_columns',None):
    display(gold_test_new2[gold_test_new2['rougher.input.floatbank11_xanthate'].isna()])
date primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level primary_cleaner.state.floatbank8_c_air ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
245 2016-09-11 05:59:59 15.630879 0.990310 7.83 0.193422 1.299719e+03 -498.963997 1.303213e+03 -499.904044 1.301021e+03 ... 8.021303 -506.306992 4.895868 -502.255707 7.904954 -508.638295 3.043868 -501.832155 19.989689 -603.877450
246 2016-09-11 06:59:59 NaN NaN 7.83 NaN 1.299738e+03 -612.606492 1.302267e+03 -505.831097 1.299494e+03 ... 8.073119 -501.810437 4.956902 -499.412118 8.153360 -492.626261 3.081493 -476.885347 20.035238 -603.205354
247 2016-09-11 07:59:59 NaN NaN 7.83 NaN 1.300137e+03 -752.693115 1.302233e+03 -575.075180 1.299606e+03 ... 8.022703 -520.809679 4.954530 -569.697771 7.900118 -513.604732 2.960581 -499.693953 19.998444 -734.132064
248 2016-09-11 08:59:59 NaN NaN 7.83 NaN 1.297507e+03 -794.938526 1.303108e+03 -798.079260 1.300093e+03 ... 7.907092 -509.244830 4.972887 -580.824925 8.079996 -519.416398 2.948391 -499.492902 19.993156 -749.453971
249 2016-09-11 09:59:59 NaN NaN 7.83 NaN 1.296324e+03 -795.541101 1.302574e+03 -798.501004 1.299455e+03 ... 8.030514 -517.587841 5.054695 -668.750178 8.218528 -552.623825 2.967558 -499.669186 20.015216 -798.761749
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5728 2017-12-26 16:59:59 68.748269 4.967088 5.80 0.546002 1.601695e+03 -401.309128 1.603799e+03 -398.454978 1.306149e+03 ... 20.007729 -500.282320 15.116499 -500.088520 11.053275 -500.223931 8.996529 -499.892002 12.003847 -500.072328
5729 2017-12-26 17:59:59 66.671316 4.982771 5.80 0.546054 1.601798e+03 -399.342516 1.601004e+03 -399.589165 1.309597e+03 ... 19.970385 -499.739051 15.034725 -499.893765 11.005241 -498.989282 9.046014 -499.756903 11.983336 -498.917245
5747 2017-12-27 11:59:59 7.469704 NaN 7.20 0.004984 5.445860e-32 -782.195107 6.647490e-32 -496.247779 4.033736e-32 ... 3.265143 -799.427634 1.080343 -799.879409 0.200600 -797.323986 1.420606 -800.118587 0.016815 -789.840007
5748 2017-12-27 12:59:59 5.630580 NaN 7.20 NaN 9.261879e+02 -686.329732 9.278570e+02 -548.715931 8.940153e+02 ... 5.882153 -798.499014 3.409673 -799.002255 2.406223 -797.219304 2.938882 -800.082169 2.454536 -803.594266
5749 2017-12-27 13:59:59 2.620827 NaN 7.20 0.004011 1.394345e+03 -771.082673 1.394144e+03 -777.498223 1.398281e+03 ... 19.962331 -797.550678 14.999044 -798.207155 10.961253 -797.003879 8.987172 -800.059438 12.016949 -804.138159

353 rows × 53 columns

date primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level primary_cleaner.state.floatbank8_c_air primary_cleaner.state.floatbank8_c_level primary_cleaner.state.floatbank8_d_air primary_cleaner.state.floatbank8_d_level rougher.input.feed_ag rougher.input.feed_pb rougher.input.feed_rate rougher.input.feed_size rougher.input.feed_sol rougher.input.feed_au rougher.input.floatbank10_sulfate rougher.input.floatbank10_xanthate rougher.input.floatbank11_sulfate rougher.input.floatbank11_xanthate rougher.state.floatbank10_a_air rougher.state.floatbank10_a_level rougher.state.floatbank10_b_air rougher.state.floatbank10_b_level rougher.state.floatbank10_c_air rougher.state.floatbank10_c_level rougher.state.floatbank10_d_air rougher.state.floatbank10_d_level rougher.state.floatbank10_e_air rougher.state.floatbank10_e_level rougher.state.floatbank10_f_air rougher.state.floatbank10_f_level secondary_cleaner.state.floatbank2_a_air secondary_cleaner.state.floatbank2_a_level secondary_cleaner.state.floatbank2_b_air secondary_cleaner.state.floatbank2_b_level secondary_cleaner.state.floatbank3_a_air secondary_cleaner.state.floatbank3_a_level secondary_cleaner.state.floatbank3_b_air secondary_cleaner.state.floatbank3_b_level secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
564 2016-09-24 12:59:59 NaN NaN 7.68 0.011503 2.799954e-18 -796.779434 4.163275e-21 -799.983618 3.855577e-19 -799.948108 2.014029e-21 -799.784812 0.000000 0.000000 0.193756 48.015584 0.0 0.000000 NaN NaN 0.096795 NaN 0.001266 -650.096515 -0.605473 -650.105302 -0.068244 -645.931153 -0.924575 -628.468917 -1.944313 -650.122621 -2.381344 -649.913483 0.023057 -799.530262 1.092967e-16 -792.875062 1.136677e-18 -799.561997 1.807957e-20 -734.556570 1.834961e-22 -745.821862 6.161089e-18 -799.802846 -0.214178 -797.367106 0.704683 -800.020932 0.161888 -799.913285
565 2016-09-24 13:59:59 NaN NaN 7.68 0.000091 2.799954e-18 -797.124505 4.163275e-21 -799.977005 3.855577e-19 -799.950312 2.014029e-21 -799.771693 0.000000 0.000000 0.204528 48.018385 0.0 0.000000 NaN NaN 0.105139 NaN 0.002585 -650.069770 -0.650662 -650.073165 -0.071626 -644.887999 -0.966635 -629.221695 -1.921022 -650.134710 -2.404892 -649.888636 0.008992 -799.545795 1.092967e-16 -794.451180 1.136677e-18 -799.586654 1.807957e-20 -735.627312 1.834961e-22 -746.906048 6.161089e-18 -799.797453 -0.215287 -798.454745 0.711425 -800.021343 0.157048 -799.922919
568 2016-09-24 16:59:59 NaN NaN 7.68 0.003025 1.392885e-01 -798.855473 5.527349e-02 -799.989777 1.014484e+00 -799.981199 9.770101e-01 -799.827218 0.000000 0.000000 0.212210 48.026788 0.0 0.000000 NaN NaN 0.021897 NaN -0.001964 -650.056702 -0.685651 -650.091352 -0.062618 -645.245733 -1.018060 -636.253339 -1.881713 -650.142921 -2.436634 -649.857932 0.010803 -799.495730 0.000000e+00 -794.478846 0.000000e+00 -799.555332 0.000000e+00 -738.153816 0.000000e+00 -748.585027 0.000000e+00 -799.831149 -0.186699 -797.669733 0.722592 -799.997136 0.179766 -799.936573
585 2016-09-25 09:59:59 NaN NaN 7.68 0.010103 2.312186e-20 -796.302283 5.786564e-21 -799.984843 1.205502e-20 -799.962190 3.571137e-21 -799.793338 0.000000 0.000000 0.166748 48.033980 0.0 0.000000 NaN NaN 0.030064 NaN -0.087418 -650.165051 -0.631677 -650.180366 -0.230563 -642.077940 -0.831368 -623.281513 -1.963070 -650.037184 -2.356340 -649.943086 3.935541 -799.491407 0.000000e+00 -791.800662 0.000000e+00 -799.609465 0.000000e+00 -732.983100 0.000000e+00 -750.653885 0.000000e+00 -747.383683 -0.181728 -796.941962 0.690744 -800.028909 0.202145 -799.903983
586 2016-09-25 10:59:59 NaN NaN 7.68 0.006077 2.312186e-20 -796.684510 5.786564e-21 -799.981251 1.205502e-20 -799.964183 3.571137e-21 -799.799478 0.000000 0.000000 0.178374 48.034314 0.0 0.000000 NaN NaN 0.109840 NaN -0.068932 -650.134236 -0.669726 -650.147578 -0.240868 -647.688869 -0.846947 -628.388343 -1.951958 -650.056851 -2.370610 -649.918108 0.475950 -799.497027 0.000000e+00 -792.206187 0.000000e+00 -799.613829 0.000000e+00 -734.963840 0.000000e+00 -750.157445 0.000000e+00 -747.406024 -0.197555 -797.115802 0.691613 -800.026245 0.187614 -799.911700
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3078 2017-09-07 06:59:59 NaN NaN 6.44 NaN 0.000000e+00 -797.408631 0.000000e+00 -799.709651 0.000000e+00 -799.952526 0.000000e+00 -799.761447 0.000000 0.000000 1.747975 51.061615 0.0 0.000000 NaN NaN 0.006781 NaN -0.030619 -651.731882 -0.298349 -650.280348 -0.026257 -634.300938 -0.355675 -627.073357 -1.843096 -641.492751 -2.154654 -649.953740 5.457716 -799.541372 0.000000e+00 -778.422522 0.000000e+00 -799.610773 0.000000e+00 -666.111484 0.000000e+00 -799.734266 0.000000e+00 -799.861995 0.501476 -798.069070 0.657183 -800.035928 0.291482 -809.551876
3079 2017-09-07 07:59:59 NaN NaN 6.44 NaN 0.000000e+00 -797.596816 0.000000e+00 -799.189274 0.000000e+00 -799.953647 0.000000e+00 -799.765142 0.000000 0.000000 1.744616 51.064115 0.0 0.000000 NaN NaN 0.007444 NaN -0.019793 -651.208723 -0.328038 -650.261088 -0.022318 -639.661897 -0.402197 -623.377953 -1.847239 -642.092749 -2.201130 -649.943623 5.445182 -799.542015 0.000000e+00 -778.430384 0.000000e+00 -799.606907 0.000000e+00 -665.891486 0.000000e+00 -799.730764 0.000000e+00 -799.864521 0.507308 -798.682121 0.650173 -800.038423 0.295164 -809.087251
4549 2017-11-07 13:59:59 0.006246 NaN 7.15 0.001874 1.601321e+03 -506.241798 1.598187e+03 -503.990762 1.496591e+03 -496.429140 1.598291e+03 -502.147076 0.295709 0.148351 5.544637 64.104260 NaN 0.302262 NaN NaN 0.041485 NaN 1195.228167 -300.289581 1396.458446 -499.850158 1345.411732 -500.639176 1243.202842 -501.887101 1148.085706 -500.526577 1048.310270 -498.953279 0.838440 -773.427423 3.603394e-03 -795.667200 2.904193e+01 -500.249506 2.300451e+01 -500.010143 2.101980e+01 -502.283743 1.901142e+01 -500.743581 15.057985 -499.898364 7.928971 -499.832515 16.020571 -499.911516
5485 2017-12-16 13:59:59 NaN NaN 7.49 0.012088 1.198172e+00 -783.388562 2.241474e+00 -799.970691 1.687280e+00 -799.644774 1.665990e+00 -579.911423 0.000000 0.000000 NaN 25.351381 0.0 0.000000 NaN NaN 0.010526 NaN 999.321327 -413.098504 1200.388407 -650.012364 1198.360990 -500.612020 999.776180 -617.944162 999.757989 -500.139406 900.204461 -648.189008 25.074803 -753.629855 1.999943e+01 -506.324089 2.498809e+01 -798.731928 1.804578e+01 -809.149110 7.349942e-17 -799.649342 8.306354e-17 -799.873377 10.949957 -608.576945 7.972650 -369.027573 12.000541 -804.865410
5486 2017-12-16 14:59:59 NaN NaN 7.49 0.008673 5.547224e-01 -784.049738 1.035048e+00 -799.985963 7.791375e-01 -799.864020 7.693061e-01 -581.016175 0.000000 0.000000 NaN 25.352417 0.0 0.000000 NaN NaN 0.029582 NaN 1001.081732 -412.607977 1200.722607 -650.016866 1201.885233 -503.141458 1001.078637 -622.168851 1001.112081 -499.743099 901.047630 -647.264918 25.094838 -789.752004 1.991080e+01 -508.635696 2.494250e+01 -798.892687 1.801429e+01 -809.148189 7.349942e-17 -799.649104 8.306354e-17 -799.877852 10.909520 -639.780856 7.952206 -354.341867 11.982311 -804.065385

116 rows × 53 columns

In [ ]:
 
In [50]:
# Compare the new histograms

gold_train_new2['rougher.input.floatbank11_xanthate'].hist(figsize = [8,5])
plt.title("Training Set")
plt.show()

gold_full_new2['rougher.input.floatbank11_xanthate'].hist(figsize = [8,5])
plt.title("Full Set")
plt.show()
No description has been provided for this image
No description has been provided for this image

Xanthate Difference Analysis: rougher.input.floatbank##_xanthate

Statistical Summary by Difference Groups

Distribution: The data appears roughly normally distributed with a slight right skew, centered around 6-7

Training | Full

image.png

After Imputation

image.png

xanthate_difference = rougher.input.floatbank10_xanthate - rougher.input.floatbank11_xanthate

Less Than -1 (174 observations)

Statistic floatbank10_xanthate floatbank11_xanthate xanthate_difference
Median 6.243 7.494 -1.416
Mean 4.438 6.770 -2.333
Range 7.6248, 0.001 8.834, 1.482 -1.001, -8.005

Greater Than 1 (93 observations)

Statistic floatbank10_xanthate floatbank11_xanthate xanthate_difference
Median 5.781 1.931 3.204
Mean 5.703 2.413 3.290
Range 8.036, 3.109 6.318, 0.000 7.576, 1.026

Between 0 - 1 (6,850 observations)

Statistic floatbank10_xanthate floatbank11_xanthate xanthate_difference
Median 6.000 5.998 0.002
Mean 5.908 5.898 0.010
Range 9.703, 0.004 9.698, 0.002 0.953, 0.000

Between 0 - (-1) (7,084 observations)

Statistic floatbank10_xanthate floatbank11_xanthate xanthate_difference
Median 5.995 6.000 -0.002
Mean 5.858 5.902 -0.044
Range 9.655, 0.001 9.667, 0.001 -0.000, -1.000

ALL (14,201 observations)

Statistic floatbank10_xanthate floatbank11_xanthate xanthate_difference
Median 5.998 5.999 -0.000
Mean 5.864 5.888 -0.024
Range 0.001 - 9.703 0.000 - 9.698 -8.005 - 7.586

Key Observations

  • Most observations fall within small differences: 13,934 out of 14,201 total observations (98.1%) have differences between -1 and +1
  • Extreme negative differences are more common: 174 observations with differences < -1 vs 93 observations with differences > +1
  • Near-equilibrium groups dominate: The "Between 0-1" and "Between 0-(-1)" groups contain the vast majority of data points
  • Largest extreme difference: -8.004946 in the "Less Than -1" group
  • Training Dataset Difference Median: -0.000087
  • Full Dataset Difference Median: -0.000075
  • Training and Full Median Difference: 0.000012

Conclusion

Given that the median difference between rougher.input.floatbank10_xanthate and rougher.input.floatbank11_xanthate is approximately 0 (-0.000075), and most observations (98.1%) fall within small differences, imputing missing floatbank11 values using the corresponding floatbank10 values (i.e., floatbank11 = floatbank10) appears reasonable. However, this assumes missing values follow the same near-equilibrium pattern as the majority of the data. This should give the most accurate representation for our model.

The median difference for the Training Set (-0.000075) and Full Set (-0.000087) are virtually the same at 0 (0.000012), validating our imputation approach. Moreover, when observing the datsets after imputation, you see that the Full Dataset changed less than the Training Dataset; this further demonstrates the accuracy of our imputation strategy.

In [51]:
# Show the histograms again, continue imputation

gold_train_new2[imp_cols].hist(figsize = [12,8])
plt.show()

test = gold_train_new2[['rougher.state.floatbank10_f_air','rougher.state.floatbank10_e_air']]

test[test.columns].hist(figsize = [10,4])
plt.show()
No description has been provided for this image
No description has been provided for this image
In [52]:
def r_state_fb_nan(df1):
    # Focus on the columns you are comparing
    df = df1.filter(like = 'rougher.state.floatbank', axis=1).copy()
    df_cols = df.filter(like = 'level').columns
    df_cols = list(df_cols)
    df = df.drop(df[df_cols], axis = 1)
    df = df.drop(['rougher.state.floatbank10_a_air','rougher.state.floatbank10_b_air','rougher.state.floatbank10_c_air','rougher.state.floatbank10_d_air'], axis = 1)
    
    # Get the difference of the columns and put it in a DF
    df_difference_f = df['rougher.state.floatbank10_f_air'] - df['rougher.state.floatbank10_e_air']
    df['difference_f'] = df_difference_f

    # Show the metrics of the entire DF
    print("Difference Metrics (fb10_e_air & fb10_f_air)")
    print("-------------------------------------------")
    print("Length")
    display(len(df['difference_f']))
    print()
    print("Median")
    display(df['difference_f'].median())
    print()
    print("Mean")
    display(df['difference_f'].mean())
    print()
    print("Std")
    display(df['difference_f'].std())
    print()
    print("Min")
    display(df['difference_f'].min())
    print()
    print("Max")
    display(df['difference_f'].max())
    print()
    print()

    # Metrics for Categories
    print("Metrics Difference: Categorical")
    print("-------------------------------------------")
    print()
    print("Difference F > 5")
    print("-------------------------------------------")
    print("Length")
    display(len(df[df['difference_f'] > 5]))
    print()
    print("Median")
    display(df[df['difference_f'] > 5].median())
    print()
    print("Mean")
    display(df[df['difference_f'] > 5].mean())
    print()
    print("Min")
    display(df[df['difference_f'] > 5].min())
    print()
    print("Max")
    display(df[df['difference_f'] > 5].max())
    print()
    print("Difference F Between 1 - 5")
    print("-------------------------------------------")
    print("Length")
    display(len(df[(df['difference_f'] > 1) & (df['difference_f'] <= 5)]))
    print()
    print("Median")
    display(df[(df['difference_f'] > 1) & (df['difference_f'] <= 5)].median())
    print()
    print("Mean")
    display(df[(df['difference_f'] > 1) & (df['difference_f'] <= 5)].mean())
    print()
    print("Min")
    display(df[(df['difference_f'] > 1) & (df['difference_f'] <= 5)].min())
    print()
    print("Max")
    display(df[(df['difference_f'] > 1) & (df['difference_f'] <= 5)].max())
    print()
    print("Difference F Between 0 - 1")
    print("-------------------------------------------")
    print("Length")
    display(len(df[(df['difference_f'] > 0) & (df['difference_f'] <= 1)]))
    print()
    print("Median")
    display(df[(df['difference_f'] > 0) & (df['difference_f'] <= 1)].median())
    print()
    print("Mean")
    display(df[(df['difference_f'] > 0) & (df['difference_f'] <= 1)].mean())
    print()
    print("Min")
    display(df[(df['difference_f'] > 0) & (df['difference_f'] <= 1)].min())
    print()
    print("Max")
    display(df[(df['difference_f'] > 0) & (df['difference_f'] <= 1)].max())
    print()
    print("Difference F Between (-)1 - 0")
    print("-------------------------------------------")
    print("Length")
    display(len(df[(df['difference_f'] >= -1) & (df['difference_f'] < 0)]))
    print()
    print("Median")
    display(df[(df['difference_f'] >= -1) & (df['difference_f'] < 0)].median())
    print()
    print("Mean")
    display(df[(df['difference_f'] >= -1) & (df['difference_f'] < 0)].mean())
    print()
    print("Min")
    display(df[(df['difference_f'] >= -1) & (df['difference_f'] < 0)].min())
    print()
    print("Max")
    display(df[(df['difference_f'] >= -1) & (df['difference_f'] < 0)].max())
    print()
    print("Difference F Between (-)5 - (-)1")
    print("-------------------------------------------")
    print("Length")
    display(len(df[(df['difference_f'] >= -5) & (df['difference_f'] < -1)]))
    print()
    print("Median")
    display(df[(df['difference_f'] >= -5) & (df['difference_f'] < -1)].median())
    print()
    print("Mean")
    display(df[(df['difference_f'] >= -5) & (df['difference_f'] < -1)].mean())
    print()
    print("Min")
    display(df[(df['difference_f'] >= -5) & (df['difference_f'] < -1)].min())
    print()
    print("Max")
    display(df[(df['difference_f'] >= -5) & (df['difference_f'] < -1)].max())
    print()
    print("Difference F Between (-)50 - (-)5")
    print("-------------------------------------------")
    print("Length")
    display(len(df[(df['difference_f'] >= -50) & (df['difference_f'] < -5)]))
    print()
    print("Median")
    display(df[(df['difference_f'] >= -50) & (df['difference_f'] < -5)].median())
    print()
    print("Mean")
    display(df[(df['difference_f'] >= -50) & (df['difference_f'] < -5)].mean())
    print()
    print("Min")
    display(df[(df['difference_f'] >= -50) & (df['difference_f'] < -5)].min())
    print()
    print("Max")
    display(df[(df['difference_f'] >= -50) & (df['difference_f'] < -5)].max())
    print()
    print("Difference F Between (-)100 - (-)50")
    print("-------------------------------------------")
    print("Length")
    display(len(df[(df['difference_f'] >= -100) & (df['difference_f'] < -50)]))
    print()
    print("Median")
    display(df[(df['difference_f'] >= -100) & (df['difference_f'] < -50)].median())
    print()
    print("Mean")
    display(df[(df['difference_f'] >= -100) & (df['difference_f'] < -50)].mean())
    print()
    print("Min")
    display(df[(df['difference_f'] >= -100) & (df['difference_f'] < -50)].min())
    print()
    print("Max")
    display(df[(df['difference_f'] >= -100) & (df['difference_f'] < -50)].max())
    print()
    print("Difference F Between (-)150 - (-)100")
    print("-------------------------------------------")
    print("Length")
    display(len(df[(df['difference_f'] >= -150) & (df['difference_f'] < -100)]))
    print()
    print("Median")
    display(df[(df['difference_f'] >= -150) & (df['difference_f'] < -100)].median())
    print()
    print("Mean")
    display(df[(df['difference_f'] >= -150) & (df['difference_f'] < -100)].mean())
    print()
    print("Min")
    display(df[(df['difference_f'] >= -150) & (df['difference_f'] < -100)].min())
    print()
    print("Max")
    display(df[(df['difference_f'] >= -150) & (df['difference_f'] < -100)].max())
    print()
    print("Difference F Between (-)220 - (-)150")
    print("-------------------------------------------")
    print("Length")
    display(len(df[(df['difference_f'] >= -220) & (df['difference_f'] < -150)]))
    print()
    print("Median")
    display(df[(df['difference_f'] >= -220) & (df['difference_f'] < -150)].median())
    print()
    print("Mean")
    display(df[(df['difference_f'] >= -220) & (df['difference_f'] < -150)].mean())
    print()
    print("Min")
    display(df[(df['difference_f'] >= -220) & (df['difference_f'] < -150)].min())
    print()
    print("Max")
    display(df[(df['difference_f'] >= -220) & (df['difference_f'] < -150)].max())
    print()
    print("Difference F < (-)200")
    print("-------------------------------------------")
    print("Length")
    display(len(df[df['difference_f'] < -200]))
    print()
    print("Median")
    display(df[df['difference_f'] < -200].median())
    print()
    print("Mean")
    display(df[df['difference_f'] < -200].mean())
    print()
    print("Min")
    display(df[df['difference_f'] < -200].min())
    print()
    print("Max")
    display(df[df['difference_f'] < -200].max())
    print()
    display(df[df['difference_f'] ==0])
    print()
    print()
    print()
    print()
    print("Difference F > 0")
    print("-------------------------------------------")
    print("Length")
    display(len(df[df['difference_f'] > 0]))
    print()
    print("Median")
    display(df[df['difference_f'] > 0].median())
    print()
    print("Mean")
    display(df[df['difference_f'] > 0].mean())
    print()
    print("Min")
    display(df[df['difference_f'] > 0].min())
    print()
    print("Max")
    display(df[df['difference_f'] > 0].max())
    print()
    print()
    print("Difference F Between (-)50 - 0")
    print("-------------------------------------------")
    print("Length")
    display(len(df[(df['difference_f'] > -50) & (df['difference_f'] <= 0)]))
    print()
    print("Median")
    display(df[(df['difference_f'] > -50) & (df['difference_f'] <= 0)].median())
    print()
    print("Mean")
    display(df[(df['difference_f'] > -50) & (df['difference_f'] <= 0)].mean())
    print()
    print("Min")
    display(df[(df['difference_f'] > -50) & (df['difference_f'] <= 0)].min())
    print()
    print("Max")
    display(df[(df['difference_f'] > -50) & (df['difference_f'] <= 0)].max())
    print()
    print()
    print("Difference F < (-)50")
    print("-------------------------------------------")
    print("Length")
    display(len(df[df['difference_f'] <= -50]))
    print()
    print("Median")
    display(df[df['difference_f'] <= -50].median())
    print()
    print("Mean")
    display(df[df['difference_f'] <= -50].mean())
    print()
    print("Min")
    display(df[df['difference_f'] <= 50].min())
    print()
    print("Max")
    display(df[df['difference_f'] <= -50].max())
In [53]:
# All the missing rougher.state.floatbank10_e_air are between 844 and 856 in rougher.state.floatbank10_f_air (except 2)

r_state_fb = gold_train_new2[['rougher.state.floatbank10_e_air','rougher.state.floatbank10_f_air']].copy()
r_state_fb_diff = gold_train_new2['rougher.state.floatbank10_f_air'] - gold_train_new2['rougher.state.floatbank10_e_air']
r_state_fb["difference_f_e"] = r_state_fb_diff
r_e_nan = r_state_fb[r_state_fb['rougher.state.floatbank10_e_air'].isna()]

r_e_nan[r_e_nan['rougher.state.floatbank10_f_air'] < 844]
r_e_nan[(r_e_nan['rougher.state.floatbank10_f_air'] > 844) & (r_e_nan['rougher.state.floatbank10_f_air'] < 856)]
Out[53]:
rougher.state.floatbank10_e_air rougher.state.floatbank10_f_air difference_f_e
11631 NaN 850.264604 NaN
11632 NaN 850.386474 NaN
11633 NaN 850.046882 NaN
11634 NaN 849.081928 NaN
11635 NaN 849.264916 NaN
... ... ... ...
12635 NaN 850.469174 NaN
12636 NaN 850.313864 NaN
12637 NaN 851.671104 NaN
12638 NaN 850.454909 NaN
12641 NaN 847.534213 NaN

506 rows × 3 columns

In [54]:
# Look at floatbank_f_air values between 844 - 856

r_state_fb_f = gold_train_new2[['rougher.state.floatbank10_a_air', 'rougher.state.floatbank10_b_air', 
                                'rougher.state.floatbank10_c_air', 'rougher.state.floatbank10_d_air','rougher.state.floatbank10_e_air',
                                'rougher.state.floatbank10_f_air']].copy()
r_state_fb_f = r_state_fb_f[(r_state_fb_f['rougher.state.floatbank10_f_air'] > 844) & 
    (r_state_fb_f['rougher.state.floatbank10_f_air'] < 856)]

r_state_fb_nan(r_state_fb_f)

r_state_fb
Difference Metrics (fb10_e_air & fb10_f_air)
-------------------------------------------
Length
1060
Median
0.030703981591784668
Mean
-25.26861356855608
Std
129.6028648493297
Min
-1072.1726457897248
Max
302.43330533676556

Metrics Difference: Categorical
-------------------------------------------

Difference F > 5
-------------------------------------------
Length
2
Median
rougher.state.floatbank10_e_air    560.396474
rougher.state.floatbank10_f_air    850.177099
difference_f                       289.780626
dtype: float64
Mean
rougher.state.floatbank10_e_air    560.396474
rougher.state.floatbank10_f_air    850.177099
difference_f                       289.780626
dtype: float64
Min
rougher.state.floatbank10_e_air    547.540324
rougher.state.floatbank10_f_air    849.973629
difference_f                       277.127946
dtype: float64
Max
rougher.state.floatbank10_e_air    573.252623
rougher.state.floatbank10_f_air    850.380569
difference_f                       302.433305
dtype: float64
Difference F Between 1 - 5
-------------------------------------------
Length
27
Median
rougher.state.floatbank10_e_air    849.528393
rougher.state.floatbank10_f_air    850.760601
difference_f                         1.314547
dtype: float64
Mean
rougher.state.floatbank10_e_air    849.565861
rougher.state.floatbank10_f_air    850.977494
difference_f                         1.411633
dtype: float64
Min
rougher.state.floatbank10_e_air    846.438053
rougher.state.floatbank10_f_air    848.870274
difference_f                         1.013688
dtype: float64
Max
rougher.state.floatbank10_e_air    852.565098
rougher.state.floatbank10_f_air    854.419003
difference_f                         2.771865
dtype: float64
Difference F Between 0 - 1
-------------------------------------------
Length
257
Median
rougher.state.floatbank10_e_air    849.779730
rougher.state.floatbank10_f_air    850.157957
difference_f                         0.332039
dtype: float64
Mean
rougher.state.floatbank10_e_air    849.760470
rougher.state.floatbank10_f_air    850.134663
difference_f                         0.374193
dtype: float64
Min
rougher.state.floatbank10_e_air    845.426450
rougher.state.floatbank10_f_air    845.446025
difference_f                         0.002805
dtype: float64
Max
rougher.state.floatbank10_e_air    854.668426
rougher.state.floatbank10_f_air    855.604359
difference_f                         0.999571
dtype: float64
Difference F Between (-)1 - 0
-------------------------------------------
Length
215
Median
rougher.state.floatbank10_e_air    850.287039
rougher.state.floatbank10_f_air    849.856819
difference_f                        -0.346674
dtype: float64
Mean
rougher.state.floatbank10_e_air    850.240513
rougher.state.floatbank10_f_air    849.874191
difference_f                        -0.366322
dtype: float64
Min
rougher.state.floatbank10_e_air    845.259282
rougher.state.floatbank10_f_air    844.795222
difference_f                        -0.999865
dtype: float64
Max
rougher.state.floatbank10_e_air    853.224909
rougher.state.floatbank10_f_air    852.896993
difference_f                        -0.004107
dtype: float64
Difference F Between (-)5 - (-)1
-------------------------------------------
Length
24
Median
rougher.state.floatbank10_e_air    851.439675
rougher.state.floatbank10_f_air    850.059194
difference_f                        -1.301314
dtype: float64
Mean
rougher.state.floatbank10_e_air    851.520245
rougher.state.floatbank10_f_air    849.902937
difference_f                        -1.617308
dtype: float64
Min
rougher.state.floatbank10_e_air    848.338074
rougher.state.floatbank10_f_air    845.838617
difference_f                        -4.909628
dtype: float64
Max
rougher.state.floatbank10_e_air    855.630225
rougher.state.floatbank10_f_air    852.509530
difference_f                        -1.008497
dtype: float64
Difference F Between (-)50 - (-)5
-------------------------------------------
Length
3
Median
rougher.state.floatbank10_e_air    868.354339
rougher.state.floatbank10_f_air    850.376424
difference_f                       -18.056114
dtype: float64
Mean
rougher.state.floatbank10_e_air    870.150239
rougher.state.floatbank10_f_air    851.789248
difference_f                       -18.360992
dtype: float64
Min
rougher.state.floatbank10_e_air    865.493748
rougher.state.floatbank10_f_air    850.298225
difference_f                       -26.226207
dtype: float64
Max
rougher.state.floatbank10_e_air    876.602631
rougher.state.floatbank10_f_air    854.693094
difference_f                       -10.800654
dtype: float64
Difference F Between (-)100 - (-)50
-------------------------------------------
Length
2
Median
rougher.state.floatbank10_e_air    907.761495
rougher.state.floatbank10_f_air    852.104309
difference_f                       -55.657185
dtype: float64
Mean
rougher.state.floatbank10_e_air    907.761495
rougher.state.floatbank10_f_air    852.104309
difference_f                       -55.657185
dtype: float64
Min
rougher.state.floatbank10_e_air    904.595784
rougher.state.floatbank10_f_air    849.955141
difference_f                       -60.972065
dtype: float64
Max
rougher.state.floatbank10_e_air    910.927206
rougher.state.floatbank10_f_air    854.253478
difference_f                       -50.342306
dtype: float64
Difference F Between (-)150 - (-)100
-------------------------------------------
Length
0
Median
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Mean
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Min
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Max
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Difference F Between (-)220 - (-)150
-------------------------------------------
Length
1
Median
rougher.state.floatbank10_e_air    1004.412752
rougher.state.floatbank10_f_air     850.441813
difference_f                       -153.970939
dtype: float64
Mean
rougher.state.floatbank10_e_air    1004.412752
rougher.state.floatbank10_f_air     850.441813
difference_f                       -153.970939
dtype: float64
Min
rougher.state.floatbank10_e_air    1004.412752
rougher.state.floatbank10_f_air     850.441813
difference_f                       -153.970939
dtype: float64
Max
rougher.state.floatbank10_e_air    1004.412752
rougher.state.floatbank10_f_air     850.441813
difference_f                       -153.970939
dtype: float64
Difference F < (-)200
-------------------------------------------
Length
23
Median
rougher.state.floatbank10_e_air    1502.374479
rougher.state.floatbank10_f_air     849.991429
difference_f                       -652.383050
dtype: float64
Mean
rougher.state.floatbank10_e_air    1470.636072
rougher.state.floatbank10_f_air     849.996424
difference_f                       -620.639648
dtype: float64
Min
rougher.state.floatbank10_e_air    1097.805808
rougher.state.floatbank10_f_air     849.365202
difference_f                      -1072.172646
dtype: float64
Max
rougher.state.floatbank10_e_air    1922.636637
rougher.state.floatbank10_f_air     850.657709
difference_f                       -247.191134
dtype: float64

rougher.state.floatbank10_e_air rougher.state.floatbank10_f_air difference_f



Difference F > 0
-------------------------------------------
Length
286
Median
rougher.state.floatbank10_e_air    849.736625
rougher.state.floatbank10_f_air    850.222866
difference_f                         0.375459
dtype: float64
Mean
rougher.state.floatbank10_e_air    847.718574
rougher.state.floatbank10_f_air    850.214528
difference_f                         2.495954
dtype: float64
Min
rougher.state.floatbank10_e_air    547.540324
rougher.state.floatbank10_f_air    845.446025
difference_f                         0.002805
dtype: float64
Max
rougher.state.floatbank10_e_air    854.668426
rougher.state.floatbank10_f_air    855.604359
difference_f                       302.433305
dtype: float64

Difference F Between (-)50 - 0
-------------------------------------------
Length
242
Median
rougher.state.floatbank10_e_air    850.364442
rougher.state.floatbank10_f_air    849.893534
difference_f                        -0.370930
dtype: float64
Mean
rougher.state.floatbank10_e_air    850.614244
rougher.state.floatbank10_f_air    849.900782
difference_f                        -0.713461
dtype: float64
Min
rougher.state.floatbank10_e_air    845.259282
rougher.state.floatbank10_f_air    844.795222
difference_f                       -26.226207
dtype: float64
Max
rougher.state.floatbank10_e_air    876.602631
rougher.state.floatbank10_f_air    854.693094
difference_f                        -0.004107
dtype: float64

Difference F < (-)50
-------------------------------------------
Length
26
Median
rougher.state.floatbank10_e_air    1499.135928
rougher.state.floatbank10_f_air     850.036561
difference_f                       -649.239148
dtype: float64
Mean
rougher.state.floatbank10_e_air    1409.406362
rougher.state.floatbank10_f_air     850.175700
difference_f                       -559.230662
dtype: float64
Min
rougher.state.floatbank10_e_air     845.259282
rougher.state.floatbank10_f_air     844.795222
difference_f                      -1072.172646
dtype: float64
Max
rougher.state.floatbank10_e_air    1922.636637
rougher.state.floatbank10_f_air     854.253478
difference_f                        -50.342306
dtype: float64
Out[54]:
rougher.state.floatbank10_e_air rougher.state.floatbank10_f_air difference_f_e
0 1404.472046 1416.354980 11.882935
1 1399.227084 1399.719514 0.492430
2 1399.180945 1400.316682 1.135737
3 1400.943157 1400.234743 -0.708414
4 1401.560902 1401.160227 -0.400675
... ... ... ...
16855 849.664935 849.758091 0.093156
16856 848.515225 850.013123 1.497898
16857 849.016017 850.455635 1.439618
16858 851.589767 851.345606 -0.244161
16859 849.441918 850.112246 0.670328

14855 rows × 3 columns

In [55]:
# Look at floatbank_f_air values between 844 - 856

r_state_fb_e = gold_train_new2[['rougher.state.floatbank10_a_air', 'rougher.state.floatbank10_b_air', 
                                'rougher.state.floatbank10_c_air', 'rougher.state.floatbank10_d_air','rougher.state.floatbank10_e_air',
                                'rougher.state.floatbank10_f_air']].copy()
r_state_fb_e = r_state_fb_e[(r_state_fb_e['rougher.state.floatbank10_e_air'] > 844) & 
    (r_state_fb_f['rougher.state.floatbank10_e_air'] < 856)]

r_state_fb_nan(r_state_fb_e)

r_state_fb_e
Difference Metrics (fb10_e_air & fb10_f_air)
-------------------------------------------
Length
523
Median
0.06080069731820004
Mean
0.03194460157707011
Std
0.6767217288758983
Min
-4.909628087496003
Max
2.771865353207545

Metrics Difference: Categorical
-------------------------------------------

Difference F > 5
-------------------------------------------
Length
0
Median
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Mean
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Min
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Max
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Difference F Between 1 - 5
-------------------------------------------
Length
27
Median
rougher.state.floatbank10_e_air    849.528393
rougher.state.floatbank10_f_air    850.760601
difference_f                         1.314547
dtype: float64
Mean
rougher.state.floatbank10_e_air    849.565861
rougher.state.floatbank10_f_air    850.977494
difference_f                         1.411633
dtype: float64
Min
rougher.state.floatbank10_e_air    846.438053
rougher.state.floatbank10_f_air    848.870274
difference_f                         1.013688
dtype: float64
Max
rougher.state.floatbank10_e_air    852.565098
rougher.state.floatbank10_f_air    854.419003
difference_f                         2.771865
dtype: float64
Difference F Between 0 - 1
-------------------------------------------
Length
257
Median
rougher.state.floatbank10_e_air    849.779730
rougher.state.floatbank10_f_air    850.157957
difference_f                         0.332039
dtype: float64
Mean
rougher.state.floatbank10_e_air    849.760470
rougher.state.floatbank10_f_air    850.134663
difference_f                         0.374193
dtype: float64
Min
rougher.state.floatbank10_e_air    845.426450
rougher.state.floatbank10_f_air    845.446025
difference_f                         0.002805
dtype: float64
Max
rougher.state.floatbank10_e_air    854.668426
rougher.state.floatbank10_f_air    855.604359
difference_f                         0.999571
dtype: float64
Difference F Between (-)1 - 0
-------------------------------------------
Length
215
Median
rougher.state.floatbank10_e_air    850.287039
rougher.state.floatbank10_f_air    849.856819
difference_f                        -0.346674
dtype: float64
Mean
rougher.state.floatbank10_e_air    850.240513
rougher.state.floatbank10_f_air    849.874191
difference_f                        -0.366322
dtype: float64
Min
rougher.state.floatbank10_e_air    845.259282
rougher.state.floatbank10_f_air    844.795222
difference_f                        -0.999865
dtype: float64
Max
rougher.state.floatbank10_e_air    853.224909
rougher.state.floatbank10_f_air    852.896993
difference_f                        -0.004107
dtype: float64
Difference F Between (-)5 - (-)1
-------------------------------------------
Length
24
Median
rougher.state.floatbank10_e_air    851.439675
rougher.state.floatbank10_f_air    850.059194
difference_f                        -1.301314
dtype: float64
Mean
rougher.state.floatbank10_e_air    851.520245
rougher.state.floatbank10_f_air    849.902937
difference_f                        -1.617308
dtype: float64
Min
rougher.state.floatbank10_e_air    848.338074
rougher.state.floatbank10_f_air    845.838617
difference_f                        -4.909628
dtype: float64
Max
rougher.state.floatbank10_e_air    855.630225
rougher.state.floatbank10_f_air    852.509530
difference_f                        -1.008497
dtype: float64
Difference F Between (-)50 - (-)5
-------------------------------------------
Length
0
Median
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Mean
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Min
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Max
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Difference F Between (-)100 - (-)50
-------------------------------------------
Length
0
Median
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Mean
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Min
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Max
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Difference F Between (-)150 - (-)100
-------------------------------------------
Length
0
Median
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Mean
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Min
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Max
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Difference F Between (-)220 - (-)150
-------------------------------------------
Length
0
Median
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Mean
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Min
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Max
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Difference F < (-)200
-------------------------------------------
Length
0
Median
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Mean
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Min
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Max
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64

rougher.state.floatbank10_e_air rougher.state.floatbank10_f_air difference_f



Difference F > 0
-------------------------------------------
Length
284
Median
rougher.state.floatbank10_e_air    849.741162
rougher.state.floatbank10_f_air    850.222866
difference_f                         0.374104
dtype: float64
Mean
rougher.state.floatbank10_e_air    849.741969
rougher.state.floatbank10_f_air    850.214792
difference_f                         0.472823
dtype: float64
Min
rougher.state.floatbank10_e_air    845.426450
rougher.state.floatbank10_f_air    845.446025
difference_f                         0.002805
dtype: float64
Max
rougher.state.floatbank10_e_air    854.668426
rougher.state.floatbank10_f_air    855.604359
difference_f                         2.771865
dtype: float64

Difference F Between (-)50 - 0
-------------------------------------------
Length
239
Median
rougher.state.floatbank10_e_air    850.352184
rougher.state.floatbank10_f_air    849.878683
difference_f                        -0.368203
dtype: float64
Mean
rougher.state.floatbank10_e_air    850.369022
rougher.state.floatbank10_f_air    849.877078
difference_f                        -0.491944
dtype: float64
Min
rougher.state.floatbank10_e_air    845.259282
rougher.state.floatbank10_f_air    844.795222
difference_f                        -4.909628
dtype: float64
Max
rougher.state.floatbank10_e_air    855.630225
rougher.state.floatbank10_f_air    852.896993
difference_f                        -0.004107
dtype: float64

Difference F < (-)50
-------------------------------------------
Length
0
Median
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Mean
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Min
rougher.state.floatbank10_e_air    845.259282
rougher.state.floatbank10_f_air    844.795222
difference_f                        -4.909628
dtype: float64
Max
rougher.state.floatbank10_e_air   NaN
rougher.state.floatbank10_f_air   NaN
difference_f                      NaN
dtype: float64
Out[55]:
rougher.state.floatbank10_a_air rougher.state.floatbank10_b_air rougher.state.floatbank10_c_air rougher.state.floatbank10_d_air rougher.state.floatbank10_e_air rougher.state.floatbank10_f_air
3410 999.972860 1049.828755 1049.518810 1059.678786 849.919943 850.299052
3505 999.749737 1045.861836 1046.091331 1032.660484 849.628940 848.677831
3506 1000.304544 1049.983603 1050.079753 1049.671111 849.502249 849.912654
3507 999.986196 1050.291601 1050.550453 1050.436171 850.165086 850.356585
3508 999.705869 1049.992500 1050.238106 1049.941949 849.619249 850.391474
... ... ... ... ... ... ...
16855 1199.245914 1149.807890 1047.963596 946.640977 849.664935 849.758091
16856 1196.569267 1147.675196 1048.565741 949.773589 848.515225 850.013123
16857 1204.866639 1149.942902 1049.604390 952.702732 849.016017 850.455635
16858 1201.904177 1154.087804 1054.009756 944.138793 851.589767 851.345606
16859 1196.238112 1147.248241 1047.279065 948.756608 849.441918 850.112246

523 rows × 6 columns

In [56]:
# Remove the NaN values where fb_10_f_air is not between 844 - 856 (Training Set)
drop_rows = gold_train_new2[(gold_train_new2['rougher.state.floatbank10_f_air'] < 844) & (gold_train_new2['rougher.state.floatbank10_e_air'].isna())]
gold_train_new2 = gold_train_new2.drop(drop_rows.index)
gold_train_new2
Out[56]:
date final.output.recovery primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
0 2016-01-15 00:00:00 70.541216 127.092003 10.128295 7.25 0.988759 1549.775757 -498.912140 1551.434204 -516.403442 ... 14.016835 -502.488007 12.099931 -504.715942 9.925633 -498.310211 8.079666 -500.470978 14.151341 -605.841980
1 2016-01-15 01:00:00 69.266198 125.629232 10.296251 7.25 1.002663 1576.166671 -500.904965 1575.950626 -499.865889 ... 13.992281 -505.503262 11.950531 -501.331529 10.039245 -500.169983 7.984757 -500.582168 13.998353 -599.787184
2 2016-01-15 02:00:00 68.116445 123.819808 11.316280 7.25 0.991265 1601.556163 -499.997791 1600.386685 -500.607762 ... 14.015015 -502.520901 11.912783 -501.133383 10.070913 -500.129135 8.013877 -500.517572 14.028663 -601.427363
3 2016-01-15 03:00:00 68.347543 122.270188 11.322140 7.25 0.996739 1599.968720 -500.951778 1600.659236 -499.677094 ... 14.036510 -500.857308 11.999550 -501.193686 9.970366 -499.201640 7.977324 -500.255908 14.005551 -599.996129
4 2016-01-15 04:00:00 66.927016 117.988169 11.913613 7.25 1.009869 1601.339707 -498.975456 1601.437854 -500.323246 ... 14.027298 -499.838632 11.953070 -501.053894 9.925709 -501.686727 7.894242 -500.356035 13.996647 -601.496691
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
16855 2018-08-18 06:59:59 73.755150 123.381787 8.028927 6.50 1.304232 1648.421193 -400.382169 1648.742005 -400.359661 ... 23.031497 -501.167942 20.007571 -499.740028 18.006038 -499.834374 13.001114 -500.155694 20.007840 -501.296428
16856 2018-08-18 07:59:59 69.049291 120.878188 7.962636 6.50 1.302419 1649.820162 -399.930973 1649.357538 -399.721222 ... 22.960095 -501.612783 20.035660 -500.251357 17.998535 -500.395178 12.954048 -499.895163 19.968498 -501.041608
16857 2018-08-18 08:59:59 67.002189 105.666118 7.955111 6.50 1.315926 1649.166761 -399.888631 1649.196904 -399.677571 ... 23.015718 -501.711599 19.951231 -499.857027 18.019543 -500.451156 13.023431 -499.914391 19.990885 -501.518452
16858 2018-08-18 09:59:59 65.523246 98.880538 7.984164 6.50 1.241969 1646.547763 -398.977083 1648.212240 -400.383265 ... 23.024963 -501.153409 20.054122 -500.314711 17.979515 -499.272871 12.992404 -499.976268 20.013986 -500.625471
16859 2018-08-18 10:59:59 70.281454 95.248427 8.078957 6.50 1.283045 1648.759906 -399.862053 1650.135395 -399.957321 ... 23.018622 -500.492702 20.020205 -500.220296 17.963512 -499.939490 12.990306 -500.080993 19.990336 -499.191575

14853 rows × 54 columns

In [57]:
# Replace the NaN values for fb_10_e_air with the values from fb_10_f_air (Training Set)

# Get the rows where fb10_f_air is between 844 - 856 and where 10_e_air is NaN
replace_values = gold_train_new2[(gold_train_new2['rougher.state.floatbank10_f_air'] > 844) & 
    (gold_train_new2['rougher.state.floatbank10_f_air'] < 856) & (gold_train_new2['rougher.state.floatbank10_e_air'].isna())]

# Separate the values from fb10_f_air
replace_values = replace_values['rougher.state.floatbank10_f_air']
replace_values

# Isolate the index
replace_values_index = replace_values.index

# Input fb10_f_air values where fb10_e_air values are NaN
gold_train_new2.loc[replace_values_index,['rougher.state.floatbank10_e_air']] = replace_values.values
In [58]:
# Remove the NaN values where fb_10_f_air is not between 844 - 856 (Full Set)

drop_rows_full = gold_full_new2[(gold_full_new2['rougher.state.floatbank10_f_air'] < 844) 
    & (gold_full_new2['rougher.state.floatbank10_e_air'].isna())]

gold_full_new2 = gold_full_new2.drop(drop_rows_full.index)
gold_full_new2
Out[58]:
date final.output.recovery primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
0 2016-01-15 00:00:00 70.541216 127.092003 10.128295 7.25 0.988759 1549.775757 -498.912140 1551.434204 -516.403442 ... 14.016835 -502.488007 12.099931 -504.715942 9.925633 -498.310211 8.079666 -500.470978 14.151341 -605.841980
1 2016-01-15 01:00:00 69.266198 125.629232 10.296251 7.25 1.002663 1576.166671 -500.904965 1575.950626 -499.865889 ... 13.992281 -505.503262 11.950531 -501.331529 10.039245 -500.169983 7.984757 -500.582168 13.998353 -599.787184
2 2016-01-15 02:00:00 68.116445 123.819808 11.316280 7.25 0.991265 1601.556163 -499.997791 1600.386685 -500.607762 ... 14.015015 -502.520901 11.912783 -501.133383 10.070913 -500.129135 8.013877 -500.517572 14.028663 -601.427363
3 2016-01-15 03:00:00 68.347543 122.270188 11.322140 7.25 0.996739 1599.968720 -500.951778 1600.659236 -499.677094 ... 14.036510 -500.857308 11.999550 -501.193686 9.970366 -499.201640 7.977324 -500.255908 14.005551 -599.996129
4 2016-01-15 04:00:00 66.927016 117.988169 11.913613 7.25 1.009869 1601.339707 -498.975456 1601.437854 -500.323246 ... 14.027298 -499.838632 11.953070 -501.053894 9.925709 -501.686727 7.894242 -500.356035 13.996647 -601.496691
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
22711 2018-08-18 06:59:59 73.755150 123.381787 8.028927 6.50 1.304232 1648.421193 -400.382169 1648.742005 -400.359661 ... 23.031497 -501.167942 20.007571 -499.740028 18.006038 -499.834374 13.001114 -500.155694 20.007840 -501.296428
22712 2018-08-18 07:59:59 69.049291 120.878188 7.962636 6.50 1.302419 1649.820162 -399.930973 1649.357538 -399.721222 ... 22.960095 -501.612783 20.035660 -500.251357 17.998535 -500.395178 12.954048 -499.895163 19.968498 -501.041608
22713 2018-08-18 08:59:59 67.002189 105.666118 7.955111 6.50 1.315926 1649.166761 -399.888631 1649.196904 -399.677571 ... 23.015718 -501.711599 19.951231 -499.857027 18.019543 -500.451156 13.023431 -499.914391 19.990885 -501.518452
22714 2018-08-18 09:59:59 65.523246 98.880538 7.984164 6.50 1.241969 1646.547763 -398.977083 1648.212240 -400.383265 ... 23.024963 -501.153409 20.054122 -500.314711 17.979515 -499.272871 12.992404 -499.976268 20.013986 -500.625471
22715 2018-08-18 10:59:59 70.281454 95.248427 8.078957 6.50 1.283045 1648.759906 -399.862053 1650.135395 -399.957321 ... 23.018622 -500.492702 20.020205 -500.220296 17.963512 -499.939490 12.990306 -500.080993 19.990336 -499.191575

20224 rows × 54 columns

In [59]:
# The full dataset has the same number of NaN in fb10_e_air that fit the conditions
gold_full_new2[(gold_full_new2['rougher.state.floatbank10_f_air'] > 844) & 
    (gold_full_new2['rougher.state.floatbank10_f_air'] < 856) & (gold_full_new2['rougher.state.floatbank10_e_air'].isna())]
Out[59]:
date final.output.recovery primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
17487 2018-01-12 14:59:59 100.000000 132.381422 8.596242 7.700000 1.128730 1499.732733 -500.124421 1498.597330 -497.323643 ... 20.037523 -503.194662 14.955429 -113.339067 10.957844 -499.813937 9.008137 -499.838607 11.007156 -500.349717
17488 2018-01-12 15:59:59 100.000000 77.158606 5.984331 7.700000 1.110370 1498.554357 -500.030759 1501.211518 -501.018996 ... 20.016512 -511.242934 15.069721 -172.029711 10.937594 -501.654700 9.036555 -500.248965 10.961386 -499.775934
17489 2018-01-12 16:59:59 82.131577 65.587127 6.254088 7.700000 1.104495 1501.375468 -499.350265 1499.672941 -500.009787 ... 19.997452 -704.771475 15.042079 -500.339424 10.984837 -501.967887 9.007186 -500.078269 10.980171 -499.857606
17490 2018-01-12 17:59:59 71.291044 119.490784 8.801517 7.700000 1.097898 1503.969890 -501.809893 1499.829552 -503.107773 ... 20.001198 -699.680112 15.075783 -499.620904 10.885297 -499.511380 9.000068 -499.740706 10.983720 -499.377833
17491 2018-01-12 18:59:59 73.616831 161.577497 10.027359 7.700000 1.091330 1385.166020 -497.876926 1386.391221 -499.180635 ... 20.059205 -569.402692 14.966746 -499.763469 10.979059 -500.057699 9.053582 -500.115290 11.008820 -500.411677
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
18491 2018-02-23 10:59:59 77.625242 153.515136 8.991989 6.970000 1.656174 1695.738082 -499.793923 1701.787220 -501.421545 ... 20.008040 -499.529951 15.021566 -499.590952 10.931721 -499.648540 9.016739 -500.025299 11.018518 -499.315250
18492 2018-02-23 11:59:59 78.416618 172.661362 9.013515 6.970000 1.730239 1699.017072 -499.871292 1699.674112 -501.113629 ... 20.006195 -499.702420 15.084852 -500.487182 10.990386 -499.785935 9.025500 -499.875635 11.003342 -498.872649
18493 2018-02-23 12:59:59 76.979071 180.495838 8.974494 7.699999 1.586058 1700.326082 -499.789732 1699.412466 -499.909644 ... 19.969979 -498.682617 15.027391 -499.613782 11.027269 -499.209635 8.962761 -499.902900 10.973810 -499.638898
18494 2018-02-23 13:59:59 74.979987 184.732692 8.975302 7.700000 1.517210 1698.170933 -499.132089 1697.989221 -500.116017 ... 20.001680 -500.716908 14.979438 -500.041012 10.948832 -499.622619 9.013544 -499.999757 10.993802 -499.471426
18497 2018-02-23 16:59:59 100.000000 NaN NaN 7.620000 0.013963 1697.286812 -500.963985 1698.993258 -500.749218 ... 20.040100 -498.606313 15.010248 -499.912705 10.955332 -498.765811 9.084368 -499.963871 11.006955 -500.226298

506 rows × 54 columns

In [60]:
# Replace the NaN values for fb_10_e_air with the values from fb_10_f_air (Full Set)

# Get the rows where fb10_f_air is between 844 - 856 and where 10_e_air is NaN
replace_values_full = gold_full_new2[(gold_full_new2['rougher.state.floatbank10_f_air'] > 844) & 
    (gold_full_new2['rougher.state.floatbank10_f_air'] < 856) & (gold_full_new2['rougher.state.floatbank10_e_air'].isna())]

# Separate the values from fb10_f_air
replace_values_full = replace_values_full['rougher.state.floatbank10_f_air']
replace_values_full

# Isolate the index
replace_values_full_index = replace_values_full.index

# Input fb10_f_air values where fb10_e_air values are NaN
gold_full_new2.loc[replace_values_full_index,['rougher.state.floatbank10_e_air']] = replace_values_full.values
In [61]:
# Replace the NaN values for fb_10_e_air with the values from fb_10_f_air (Test Set)

# Get the rows where fb10_f_air is between 844 - 856 and where 10_e_air is NaN
replace_values_test = gold_test_new2[(gold_test_new2['rougher.state.floatbank10_f_air'] > 844) & 
    (gold_test_new2['rougher.state.floatbank10_f_air'] < 856) & (gold_test_new2['rougher.state.floatbank10_e_air'].isna())]

# Separate the values from fb10_f_air
replace_values_test = replace_values_test['rougher.state.floatbank10_f_air']
replace_values_test

# Isolate the index
replace_values_test_index = replace_values_test.index

# Input fb10_f_air values where fb10_e_air values are NaN
gold_test_new2.loc[replace_values_test_index,['rougher.state.floatbank10_e_air']] = replace_values_test.values
In [62]:
# Compare histograms

test_hist = gold_train_new2[['rougher.state.floatbank10_f_air','rougher.state.floatbank10_e_air']]

test_hist[test_hist.columns].hist(figsize = [10,4])
plt.show()
No description has been provided for this image

Floatbank Air Difference Analysis (fb10_e_air & fb10_f_air) - Detailed Breakdown

Filtered for floatbank10_f_air between 844-856 to focus on normal operating range for imputation

Overall Difference Metrics: f between 844-856

Metric Value
Length 1,060
Median 0.031
Mean -25.269
Std 129.603
Min -1,072.173
Max 302.433

Categorical Breakdown by Difference Groups: f between 844-856

Group Category Variable Median Mean Min Max
Difference F > 5 (2 observations)
floatbank10_e_air 560.396 560.396 547.540 573.253
floatbank10_f_air 850.177 850.177 849.974 850.381
difference_f 289.781 289.781 277.128 302.433
Difference F Between 1-5 (27 observations)
floatbank10_e_air 849.528 849.566 846.438 852.565
floatbank10_f_air 850.761 850.977 848.870 854.419
difference_f 1.315 1.412 1.014 2.772
Difference F Between 0-1 (257 observations)
floatbank10_e_air 849.780 849.760 845.426 854.668
floatbank10_f_air 850.158 850.135 845.446 855.604
difference_f 0.332 0.374 0.003 1.000
Difference F Between -1 to 0 (215 observations)
floatbank10_e_air 850.287 850.241 845.259 853.225
floatbank10_f_air 849.857 849.874 844.795 852.897
difference_f -0.347 -0.366 -1.000 -0.004
Difference F Between -5 to -1 (24 observations)
floatbank10_e_air 851.440 851.520 848.338 855.630
floatbank10_f_air 850.059 849.903 845.839 852.510
difference_f -1.301 -1.617 -4.910 -1.008
Difference F Between -50 to -5 (3 observations)
floatbank10_e_air 868.354 870.150 865.494 876.603
floatbank10_f_air 850.376 851.789 850.298 854.693
difference_f -18.056 -18.361 -26.226 -10.801
Difference F Between -100 to -50 (2 observations)
floatbank10_e_air 907.761 907.761 904.596 910.927
floatbank10_f_air 852.104 852.104 849.955 854.253
difference_f -55.657 -55.657 -60.972 -50.342
Difference F Between -150 to -100 (0 observations)
floatbank10_e_air N/A N/A N/A N/A
floatbank10_f_air N/A N/A N/A N/A
difference_f N/A N/A N/A N/A
Difference F Between -220 to -150 (1 observation)
floatbank10_e_air 1,004.413 1,004.413 1,004.413 1,004.413
floatbank10_f_air 850.442 850.442 850.442 850.442
difference_f -153.971 -153.971 -153.971 -153.971
Difference F < -200 (23 observations)
floatbank10_e_air 1,502.374 1,470.636 1,097.806 1,922.637
floatbank10_f_air 849.991 849.996 849.365 850.658
difference_f -652.383 -620.640 -1,072.173 -247.191

Overall Difference Metrics: e between 844-856

Metric Value
Length 523
Median 0.061
Mean 0.032
Std 0.677
Min -4.910
Max 2.772

Categorical Breakdown by Difference Groups: e between 844-856

Group Category Variable Median Mean Min Max
Difference F > 0 (284 observations)
floatbank10_e_air 849.741 849.742 845.426 854.668
floatbank10_f_air 850.223 850.215 845.446 855.604
difference_f 0.374 0.473 0.003 2.772
Difference F Between -50 to 0 (239 observations)
floatbank10_e_air 850.352 850.369 845.259 855.630
floatbank10_f_air 849.879 849.877 844.795 852.897
difference_f -0.368 -0.492 -4.910 -0.004
Difference F < -50 (0 observations)
floatbank10_e_air N/A N/A N/A N/A
floatbank10_f_air N/A N/A N/A N/A
difference_f N/A N/A N/A N/A

Key Observations for Imputation Strategy

Filtered Dataset Results (Both e_air and f_air between 844-856):

  • Dataset size reduced: From 1,060 to 523 observations after filtering both variables to normal operating range
  • All extreme outlier categories eliminated: No observations in categories beyond -5 to +5 difference range
  • Improved statistics: Standard deviation dropped from 129.603 to 0.677, mean shifted from -25.269 to 0.032
  • Only normal operating differences remain: 523 observations distributed across -5 to +2.8 range

Distribution in filtered dataset:

  • Difference F > 0: 284 observations (54.3%)
  • Difference F Between -5 to 0: 239 observations (45.7%)
  • All extreme categories (< -5 or > 5): 0 observations

Imputation Strategy Validation:

  • Median difference: 0.061 (very close to 0)
  • Mean difference: 0.032 (very close to 0)
  • Range: -4.91 to +2.77 (all within reasonable sensor variance)

Conclusion:

Filtering both variables to the 844-856 range successfully isolates normal operating conditions. The relationship floatbank10_e_air ≈ floatbank10_f_air (difference ≈ 0) is strongly validated for imputation in this range. Using floatbank10_e_air = floatbank10_f_air + 0.06 or simply floatbank10_e_air = floatbank10_f_air is well-justified for missing values within the normal operating range.

Therefore, reasonable to drop the 2 NaN values where f is not in the 844 - 856 range and fill the rest of the floatbank_e_air NaN values to the same values as floatbank_f_air.

In [63]:
# View the missing data

gold_train_new2.isna().sum()
Out[63]:
date                                            0
final.output.recovery                           0
primary_cleaner.input.sulfate                 259
primary_cleaner.input.depressant              189
primary_cleaner.input.feed_size                 0
primary_cleaner.input.xanthate                190
primary_cleaner.state.floatbank8_a_air          0
primary_cleaner.state.floatbank8_a_level        0
primary_cleaner.state.floatbank8_b_air          0
primary_cleaner.state.floatbank8_b_level        0
primary_cleaner.state.floatbank8_c_air          0
primary_cleaner.state.floatbank8_c_level        0
primary_cleaner.state.floatbank8_d_air          0
primary_cleaner.state.floatbank8_d_level        0
rougher.input.feed_ag                           0
rougher.input.feed_pb                           0
rougher.input.feed_rate                       183
rougher.input.feed_size                         0
rougher.input.feed_sol                          0
rougher.input.feed_au                           0
rougher.input.floatbank10_sulfate             257
rougher.input.floatbank10_xanthate              0
rougher.input.floatbank11_sulfate             262
rougher.input.floatbank11_xanthate              0
rougher.state.floatbank10_a_air                 0
rougher.state.floatbank10_a_level               0
rougher.state.floatbank10_b_air                 0
rougher.state.floatbank10_b_level               0
rougher.state.floatbank10_c_air                 0
rougher.state.floatbank10_c_level               0
rougher.state.floatbank10_d_air                 0
rougher.state.floatbank10_d_level               0
rougher.state.floatbank10_e_air                 0
rougher.state.floatbank10_e_level               0
rougher.state.floatbank10_f_air                 0
rougher.state.floatbank10_f_level               0
secondary_cleaner.state.floatbank2_a_air      220
secondary_cleaner.state.floatbank2_a_level      0
secondary_cleaner.state.floatbank2_b_air        0
secondary_cleaner.state.floatbank2_b_level      0
secondary_cleaner.state.floatbank3_a_air        0
secondary_cleaner.state.floatbank3_a_level      0
secondary_cleaner.state.floatbank3_b_air        0
secondary_cleaner.state.floatbank3_b_level      0
secondary_cleaner.state.floatbank4_a_air        0
secondary_cleaner.state.floatbank4_a_level      0
secondary_cleaner.state.floatbank4_b_air        0
secondary_cleaner.state.floatbank4_b_level      0
secondary_cleaner.state.floatbank5_a_air        0
secondary_cleaner.state.floatbank5_a_level      0
secondary_cleaner.state.floatbank5_b_air        0
secondary_cleaner.state.floatbank5_b_level      0
secondary_cleaner.state.floatbank6_a_air        0
secondary_cleaner.state.floatbank6_a_level      0
dtype: int64
In [64]:
# Compare Histograms for rougher.input.floatbank11_sulfate and rougher.input.floatbank10_sulfate

sulf_hist = gold_train_new2[['rougher.input.floatbank11_sulfate','rougher.input.floatbank10_sulfate']]

sulf_hist[sulf_hist.columns].hist(figsize = [10,6])
Out[64]:
array([[<AxesSubplot:title={'center':'rougher.input.floatbank11_sulfate'}>,
        <AxesSubplot:title={'center':'rougher.input.floatbank10_sulfate'}>]],
      dtype=object)
No description has been provided for this image
In [65]:
# Understand the data

sulfate_train = sulf_hist.copy()
sulfate_train

len(sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'].isna()) & (sulfate_train['rougher.input.floatbank11_sulfate'] < 5)])


sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'].isna()) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5) & (sulfate_train['rougher.input.floatbank11_sulfate'] < 15)]


sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'].isna()) & (sulfate_train['rougher.input.floatbank11_sulfate'] < 0)]

sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'].isna()) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 15) & (sulfate_train['rougher.input.floatbank11_sulfate'] < 25)]

sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'].isna()) & (sulfate_train['rougher.input.floatbank11_sulfate'] > 25 )]


sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'].isna()) & (sulfate_train['rougher.input.floatbank11_sulfate'] < 1)]
Out[65]:
rougher.input.floatbank11_sulfate rougher.input.floatbank10_sulfate
10808 0.012598 NaN
12568 0.147089 NaN
12569 0.024857 NaN
12641 0.018676 NaN
12642 0.013386 NaN
13460 0.000969 NaN
13461 0.000608 NaN
13464 0.007027 NaN
13466 0.003277 NaN
16053 0.215863 NaN
16210 0.000399 NaN
16211 0.002165 NaN
16271 0.148911 NaN
16272 0.215205 NaN
16273 0.216342 NaN
16274 0.178477 NaN
16275 0.169855 NaN
16276 0.160534 NaN
16277 0.034283 NaN
16607 0.328736 NaN
16608 0.297620 NaN
16610 0.259913 NaN
In [66]:
# Continue to understand the data

sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'].isna()) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] >= 0) & (sulfate_train['rougher.input.floatbank10_sulfate'] <= 1)]


sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'].notna()) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] >= 0) & (sulfate_train['rougher.input.floatbank10_sulfate'] <= 1)]


sulfate_train['sulfate_10_minus_11'] = sulfate_train['rougher.input.floatbank10_sulfate'] - sulfate_train['rougher.input.floatbank11_sulfate']

sulfate_train[sulfate_train['sulfate_10_minus_11'] < 1]
Out[66]:
rougher.input.floatbank11_sulfate rougher.input.floatbank10_sulfate sulfate_10_minus_11
0 11.836743 11.986616 0.149873
1 11.996163 11.971193 -0.024970
2 11.920305 11.920603 0.000298
3 11.692450 11.630094 -0.062356
4 10.960521 10.957755 -0.002766
... ... ... ...
16855 7.766744 7.762770 -0.003974
16856 7.095508 7.356687 0.261179
16857 6.584130 6.586020 0.001890
16858 6.209517 6.210119 0.000602
16859 6.168939 6.146982 -0.021957

14094 rows × 3 columns

In [67]:
# When fb11_sulfate between 5.9 - 13.1; the difference tends to be extremely similar
sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_10_minus_11'] > 1)]
Out[67]:
rougher.input.floatbank11_sulfate rougher.input.floatbank10_sulfate sulfate_10_minus_11
2917 7.996953 9.000358 1.003405
3104 7.607556 14.887618 7.280063
5399 6.321435 8.494991 2.173556
5630 8.630431 9.997678 1.367247
5806 9.974080 11.012910 1.038831
7912 6.600964 10.359296 3.758332
8626 7.337104 9.769538 2.432434
8864 7.470518 9.497716 2.027198
9910 10.127905 14.000471 3.872566
10184 6.677624 10.046377 3.368753
10239 11.343445 15.743700 4.400254
15221 9.882943 11.069928 1.186984
15222 8.822537 11.805334 2.982797
15223 9.865506 11.442276 1.576769
In [68]:
# rougher.input.floatbank11_sulfate & rougher.input.floatbank10_sulfate comparison
sulfate_train = gold_train_new2.filter(like='_sulfate', axis = 1).copy()
sulfate_difference_train = sulfate_train['rougher.input.floatbank10_sulfate'] - sulfate_train['rougher.input.floatbank11_sulfate']
sulfate_train['sulfate_difference'] = sulfate_difference_train

print("DataFrame")
display(sulfate_train)
print()
print("Median")
display(sulfate_train.median())
print()
print("Mean")
display(sulfate_train.mean())
print()
print("Min")
display(sulfate_train.min())
print()
print("Max")
display(sulfate_train.max())
print()
print("Length")
display(len(sulfate_train))
print()
print()




print(f"Sulfate 11 between 5.9 - 13.1 & Differences Greater Than 1")
print("-----------------------------------")
print("Median")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] > 1)].median())
print()
print("Mean")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] > 1)].mean())
print()
print("Min")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] > 1)].min())
print()
print("Max")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] > 1)].max())
print()
print("Length")
display(len(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] > 1)]))
print()
print()


print(f"Sulfate 11 between 5.9 - 13.1 & Difference Between 0 - 1")
print("-----------------------------------")
print("Median")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] > 0) & 
    (sulfate_train['sulfate_difference'] < 1)].median())
print()
print("Mean")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] > 0) 
    & (sulfate_train['sulfate_difference'] < 1)].mean())
print()
print("Min")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] > 0) & 
    (sulfate_train['sulfate_difference'] < 1)].min())
print()
print("Max")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] > 0) 
    & (sulfate_train['sulfate_difference'] < 1)].max())
print()
print("Length")
display(len(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] > 0) & 
    (sulfate_train['sulfate_difference'] < 1)]))
print()
print()

print(f"Sulfate 11 between 5.9 - 13.1 & Difference Between 0 - (-)1")
print("-----------------------------------")
print("Median")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] < 0) & 
    (sulfate_train['sulfate_difference'] > -1)].median())
print()
print("Mean")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] < 0) 
    & (sulfate_train['sulfate_difference'] > -1)].mean())
print()
print("Min")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] < 0) & 
    (sulfate_train['sulfate_difference'] > -1)].min())
print()
print("Max")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] < 0) 
    & (sulfate_train['sulfate_difference'] > -1)].max())
print()
print("Length")
display(len(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] < 0) & 
    (sulfate_train['sulfate_difference'] > -1)]))
print()
print()



print(f"Sulfate 11 between 5.9 - 13.1 & Differences Less Than -1")
print("-----------------------------------")
print("Median")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] < -1)].median())
print()
print("Mean")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] < -1)].mean())
print()
print("Min")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] < -1)].min())
print()
print("Max")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] < -1)].max())
print()
print("Length")
display(len(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 5.9) & (sulfate_train['sulfate_difference'] < -1)]))
print()
print()


print(f"Sulfate 11 Less Than 1 & Differences Greater Than 1")
print("-----------------------------------")
print("Median")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] > 1)].median())
print()
print("Mean")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] > 1)].mean())
print()
print("Min")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] > 1)].min())
print()
print("Max")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] > 1)].max())
print()
print("Length")
display(len(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] > 1)]))
print()
print()

print(f"Sulfate 11 Less Than 1 & Difference Between 10 - 15")
print("-----------------------------------")
print("Median")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] > 10) & 
    (sulfate_train['sulfate_difference'] < 15)].median())
print()
print("Mean")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] > 10) 
    & (sulfate_train['sulfate_difference'] < 15)].mean())
print()
print("Min")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] > 10) & 
    (sulfate_train['sulfate_difference'] < 15)].min())
print()
print("Max")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] > 10) 
    & (sulfate_train['sulfate_difference'] < 15)].max())
print()
print("Length")
display(len(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] > 10) & 
    (sulfate_train['sulfate_difference'] < 15)]))
print()
print()

print(f"Sulfate 11 Less Than 1 & Difference Between 0 - 1")
print("-----------------------------------")
print("Median")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] > 0) & 
    (sulfate_train['sulfate_difference'] < 1)].median())
print()
print("Mean")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] > 0) 
    & (sulfate_train['sulfate_difference'] < 1)].mean())
print()
print("Min")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] > 0) & 
    (sulfate_train['sulfate_difference'] < 1)].min())
print()
print("Max")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] > 0) 
    & (sulfate_train['sulfate_difference'] < 1)].max())
print()
print("Length")
display(len(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] > 0) & 
    (sulfate_train['sulfate_difference'] < 1)]))
print()
print()

print(f"Sulfate 11 Less Than 1 & Difference Between 0 - (-)1")
print("-----------------------------------")
print("Median")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] < 0) & 
    (sulfate_train['sulfate_difference'] > -1)].median())
print()
print("Mean")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] < 0) 
    & (sulfate_train['sulfate_difference'] > -1)].mean())
print()
print("Min")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] < 0) & 
    (sulfate_train['sulfate_difference'] > -1)].min())
print()
print("Max")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] < 0) 
    & (sulfate_train['sulfate_difference'] > -1)].max())
print()
print("Length")
display(len(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] < 0) & 
    (sulfate_train['sulfate_difference'] > -1)]))
print()
print()



print(f"Sulfate 11 Less Than 1 & Differences Less Than -1")
print("-----------------------------------")
print("Median")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] < -1)].median())
print()
print("Mean")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] < -1)].mean())
print()
print("Min")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] < -1)].min())
print()
print("Max")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] < -1)].max())
print()
print("Length")
display(len(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 1) & (sulfate_train['sulfate_difference'] < -1)]))
print()
print()
DataFrame
rougher.input.floatbank10_sulfate rougher.input.floatbank11_sulfate sulfate_difference
0 11.986616 11.836743 0.149873
1 11.971193 11.996163 -0.024970
2 11.920603 11.920305 0.000298
3 11.630094 11.692450 -0.062356
4 10.957755 10.960521 -0.002766
... ... ... ...
16855 7.762770 7.766744 -0.003974
16856 7.356687 7.095508 0.261179
16857 6.586020 6.584130 0.001890
16858 6.210119 6.209517 0.000602
16859 6.146982 6.168939 -0.021957

14853 rows × 3 columns

Median
rougher.input.floatbank10_sulfate    11.708082
rougher.input.floatbank11_sulfate    11.413828
sulfate_difference                    0.000085
dtype: float64
Mean
rougher.input.floatbank10_sulfate    11.763041
rougher.input.floatbank11_sulfate    11.389276
sulfate_difference                    0.360227
dtype: float64
Min
rougher.input.floatbank10_sulfate     0.000044
rougher.input.floatbank11_sulfate     0.000049
sulfate_difference                  -12.977835
dtype: float64
Max
rougher.input.floatbank10_sulfate    36.118275
rougher.input.floatbank11_sulfate    37.980648
sulfate_difference                   23.746576
dtype: float64
Length
14853

Sulfate 11 between 5.9 - 13.1 & Differences Greater Than 1
-----------------------------------
Median
rougher.input.floatbank10_sulfate    10.686103
rougher.input.floatbank11_sulfate     8.313692
sulfate_difference                    2.302995
dtype: float64
Mean
rougher.input.floatbank10_sulfate    11.223442
rougher.input.floatbank11_sulfate     8.475643
sulfate_difference                    2.747799
dtype: float64
Min
rougher.input.floatbank10_sulfate    8.494991
rougher.input.floatbank11_sulfate    6.321435
sulfate_difference                   1.003405
dtype: float64
Max
rougher.input.floatbank10_sulfate    15.743700
rougher.input.floatbank11_sulfate    11.343445
sulfate_difference                    7.280063
dtype: float64
Length
14

Sulfate 11 between 5.9 - 13.1 & Difference Between 0 - 1
-----------------------------------
Median
rougher.input.floatbank10_sulfate    10.696663
rougher.input.floatbank11_sulfate    10.686777
sulfate_difference                    0.002234
dtype: float64
Mean
rougher.input.floatbank10_sulfate    10.443461
rougher.input.floatbank11_sulfate    10.435455
sulfate_difference                    0.008006
dtype: float64
Min
rougher.input.floatbank10_sulfate    5.905364
rougher.input.floatbank11_sulfate    5.905142
sulfate_difference                   0.000002
dtype: float64
Max
rougher.input.floatbank10_sulfate    13.236237
rougher.input.floatbank11_sulfate    13.097828
sulfate_difference                    0.997880
dtype: float64
Length
4945

Sulfate 11 between 5.9 - 13.1 & Difference Between 0 - (-)1
-----------------------------------
Median
rougher.input.floatbank10_sulfate    10.629334
rougher.input.floatbank11_sulfate    10.645323
sulfate_difference                   -0.002251
dtype: float64
Mean
rougher.input.floatbank10_sulfate    10.436288
rougher.input.floatbank11_sulfate    10.443723
sulfate_difference                   -0.007435
dtype: float64
Min
rougher.input.floatbank10_sulfate    5.913521
rougher.input.floatbank11_sulfate    5.917965
sulfate_difference                  -0.998905
dtype: float64
Max
rougher.input.floatbank10_sulfate    13.095482
rougher.input.floatbank11_sulfate    13.096271
sulfate_difference                   -0.000001
dtype: float64
Length
4966

Sulfate 11 between 5.9 - 13.1 & Differences Less Than -1
-----------------------------------
Median
rougher.input.floatbank10_sulfate    5.293542
rougher.input.floatbank11_sulfate    8.681380
sulfate_difference                  -3.120086
dtype: float64
Mean
rougher.input.floatbank10_sulfate    4.918540
rougher.input.floatbank11_sulfate    9.474758
sulfate_difference                  -4.556217
dtype: float64
Min
rougher.input.floatbank10_sulfate     0.001164
rougher.input.floatbank11_sulfate     6.434697
sulfate_difference                  -12.977835
dtype: float64
Max
rougher.input.floatbank10_sulfate    11.597201
rougher.input.floatbank11_sulfate    13.004236
sulfate_difference                   -1.001706
dtype: float64
Length
28

Sulfate 11 Less Than 1 & Differences Greater Than 1
-----------------------------------
Median
rougher.input.floatbank10_sulfate    12.998993
rougher.input.floatbank11_sulfate     0.029289
sulfate_difference                   12.963763
dtype: float64
Mean
rougher.input.floatbank10_sulfate    13.014298
rougher.input.floatbank11_sulfate     0.028336
sulfate_difference                   12.985962
dtype: float64
Min
rougher.input.floatbank10_sulfate    1.239699
rougher.input.floatbank11_sulfate    0.000086
sulfate_difference                   1.049670
dtype: float64
Max
rougher.input.floatbank10_sulfate    23.748453
rougher.input.floatbank11_sulfate     0.240833
sulfate_difference                   23.746576
dtype: float64
Length
402

Sulfate 11 Less Than 1 & Difference Between 10 - 15
-----------------------------------
Median
rougher.input.floatbank10_sulfate    12.998977
rougher.input.floatbank11_sulfate     0.030791
sulfate_difference                   12.963718
dtype: float64
Mean
rougher.input.floatbank10_sulfate    12.713960
rougher.input.floatbank11_sulfate     0.030226
sulfate_difference                   12.683733
dtype: float64
Min
rougher.input.floatbank10_sulfate    10.000831
rougher.input.floatbank11_sulfate     0.000529
sulfate_difference                   10.000302
dtype: float64
Max
rougher.input.floatbank10_sulfate    14.999096
rougher.input.floatbank11_sulfate     0.240833
sulfate_difference                   14.989952
dtype: float64
Length
318

Sulfate 11 Less Than 1 & Difference Between 0 - 1
-----------------------------------
Median
rougher.input.floatbank10_sulfate    0.042575
rougher.input.floatbank11_sulfate    0.012525
sulfate_difference                   0.023018
dtype: float64
Mean
rougher.input.floatbank10_sulfate    0.306711
rougher.input.floatbank11_sulfate    0.220734
sulfate_difference                   0.085977
dtype: float64
Min
rougher.input.floatbank10_sulfate    0.001472
rougher.input.floatbank11_sulfate    0.000049
sulfate_difference                   0.000201
dtype: float64
Max
rougher.input.floatbank10_sulfate    1.352146
rougher.input.floatbank11_sulfate    0.961210
sulfate_difference                   0.390936
dtype: float64
Length
10

Sulfate 11 Less Than 1 & Difference Between 0 - (-)1
-----------------------------------
Median
rougher.input.floatbank10_sulfate    0.009081
rougher.input.floatbank11_sulfate    0.159014
sulfate_difference                  -0.034193
dtype: float64
Mean
rougher.input.floatbank10_sulfate    0.210179
rougher.input.floatbank11_sulfate    0.321202
sulfate_difference                  -0.111023
dtype: float64
Min
rougher.input.floatbank10_sulfate    0.001530
rougher.input.floatbank11_sulfate    0.003947
sulfate_difference                  -0.371064
dtype: float64
Max
rougher.input.floatbank10_sulfate    0.676090
rougher.input.floatbank11_sulfate    0.829769
sulfate_difference                  -0.000815
dtype: float64
Length
11

Sulfate 11 Less Than 1 & Differences Less Than -1
-----------------------------------
Median
rougher.input.floatbank10_sulfate   NaN
rougher.input.floatbank11_sulfate   NaN
sulfate_difference                  NaN
dtype: float64
Mean
rougher.input.floatbank10_sulfate   NaN
rougher.input.floatbank11_sulfate   NaN
sulfate_difference                  NaN
dtype: float64
Min
rougher.input.floatbank10_sulfate   NaN
rougher.input.floatbank11_sulfate   NaN
sulfate_difference                  NaN
dtype: float64
Max
rougher.input.floatbank10_sulfate   NaN
rougher.input.floatbank11_sulfate   NaN
sulfate_difference                  NaN
dtype: float64
Length
0

In [69]:
# Since 99.6 % of the data is ~0 when fb11_sulfate is between 5.9 - 13.1; 
# fill the fb10_sulfate NaN values, when fb11 is in the specified range, to fb11 values

# Training Set

training_values = gold_train_new2[(gold_train_new2['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (gold_train_new2['rougher.input.floatbank11_sulfate'] > 5.9) & (gold_train_new2['rougher.input.floatbank10_sulfate'].isna())]

training_values = training_values['rougher.input.floatbank11_sulfate']
training_values_index = training_values.index

gold_train_new2.loc[training_values_index, ['rougher.input.floatbank10_sulfate']] = training_values

gold_train_new2['rougher.input.floatbank10_sulfate'].isna().sum()

# Full DataSet

full_values = gold_full_new2[(gold_full_new2['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (gold_full_new2['rougher.input.floatbank11_sulfate'] > 5.9) & (gold_full_new2['rougher.input.floatbank10_sulfate'].isna())]

full_values = full_values['rougher.input.floatbank11_sulfate']
full_values_index = full_values.index

gold_full_new2.loc[full_values_index, ['rougher.input.floatbank10_sulfate']] = full_values

gold_full_new2['rougher.input.floatbank10_sulfate'].isna().sum()


# Test Set

test_values = gold_test_new2[(gold_test_new2['rougher.input.floatbank11_sulfate'] < 13.1) & 
    (gold_test_new2['rougher.input.floatbank11_sulfate'] > 5.9) & (gold_test_new2['rougher.input.floatbank10_sulfate'].isna())]

test_values = test_values['rougher.input.floatbank11_sulfate']
test_values_index = test_values.index

gold_test_new2.loc[test_values_index, ['rougher.input.floatbank10_sulfate']] = test_values

gold_test_new2['rougher.input.floatbank10_sulfate'].isna().sum()
Out[69]:
254
In [70]:
# Drop the remaining missing values from fb10_sulfate (Training Set)
drop_train_values = gold_train_new2[gold_train_new2['rougher.input.floatbank10_sulfate'].isna()]

drop_train_values_index = drop_train_values.index

gold_train_new2 = gold_train_new2.drop(index = drop_train_values_index, axis = 1)


# Drop the remaining missing values from fb10_sulfate (Full Set)
drop_full_values = gold_full_new2[gold_full_new2['rougher.input.floatbank10_sulfate'].isna()]

drop_full_values_index = drop_full_values.index

gold_full_new2 = gold_full_new2.drop(index = drop_full_values_index, axis = 1)
gold_full_new2
Out[70]:
date final.output.recovery primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
0 2016-01-15 00:00:00 70.541216 127.092003 10.128295 7.25 0.988759 1549.775757 -498.912140 1551.434204 -516.403442 ... 14.016835 -502.488007 12.099931 -504.715942 9.925633 -498.310211 8.079666 -500.470978 14.151341 -605.841980
1 2016-01-15 01:00:00 69.266198 125.629232 10.296251 7.25 1.002663 1576.166671 -500.904965 1575.950626 -499.865889 ... 13.992281 -505.503262 11.950531 -501.331529 10.039245 -500.169983 7.984757 -500.582168 13.998353 -599.787184
2 2016-01-15 02:00:00 68.116445 123.819808 11.316280 7.25 0.991265 1601.556163 -499.997791 1600.386685 -500.607762 ... 14.015015 -502.520901 11.912783 -501.133383 10.070913 -500.129135 8.013877 -500.517572 14.028663 -601.427363
3 2016-01-15 03:00:00 68.347543 122.270188 11.322140 7.25 0.996739 1599.968720 -500.951778 1600.659236 -499.677094 ... 14.036510 -500.857308 11.999550 -501.193686 9.970366 -499.201640 7.977324 -500.255908 14.005551 -599.996129
4 2016-01-15 04:00:00 66.927016 117.988169 11.913613 7.25 1.009869 1601.339707 -498.975456 1601.437854 -500.323246 ... 14.027298 -499.838632 11.953070 -501.053894 9.925709 -501.686727 7.894242 -500.356035 13.996647 -601.496691
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
22711 2018-08-18 06:59:59 73.755150 123.381787 8.028927 6.50 1.304232 1648.421193 -400.382169 1648.742005 -400.359661 ... 23.031497 -501.167942 20.007571 -499.740028 18.006038 -499.834374 13.001114 -500.155694 20.007840 -501.296428
22712 2018-08-18 07:59:59 69.049291 120.878188 7.962636 6.50 1.302419 1649.820162 -399.930973 1649.357538 -399.721222 ... 22.960095 -501.612783 20.035660 -500.251357 17.998535 -500.395178 12.954048 -499.895163 19.968498 -501.041608
22713 2018-08-18 08:59:59 67.002189 105.666118 7.955111 6.50 1.315926 1649.166761 -399.888631 1649.196904 -399.677571 ... 23.015718 -501.711599 19.951231 -499.857027 18.019543 -500.451156 13.023431 -499.914391 19.990885 -501.518452
22714 2018-08-18 09:59:59 65.523246 98.880538 7.984164 6.50 1.241969 1646.547763 -398.977083 1648.212240 -400.383265 ... 23.024963 -501.153409 20.054122 -500.314711 17.979515 -499.272871 12.992404 -499.976268 20.013986 -500.625471
22715 2018-08-18 10:59:59 70.281454 95.248427 8.078957 6.50 1.283045 1648.759906 -399.862053 1650.135395 -399.957321 ... 23.018622 -500.492702 20.020205 -500.220296 17.963512 -499.939490 12.990306 -500.080993 19.990336 -499.191575

20000 rows × 54 columns

In [71]:
# Compare Histograms (Training Set)
gold_train_new2[['rougher.input.floatbank10_sulfate','rougher.input.floatbank11_sulfate']].hist(figsize = [10,6])
plt.show()
No description has been provided for this image

Sulfate Difference Analysis: rougher.input.floatbank10_sulfate vs floatbank11_sulfate

sulfate_difference = rougher.input.floatbank10_sulfate - rougher.input.floatbank11_sulfate

Overall Dataset Statistics

Metric floatbank10_sulfate floatbank11_sulfate sulfate_difference
Length 14,853 14,853 14,853
Median 11.708 11.414 0.000085
Mean 11.763 11.389 0.360
Min 0.000044 0.000049 -12.978
Max 36.118 37.981 23.747

Categorical Breakdown by Sulfate11 Range and Difference Groups

Group Category Variable Median Mean Min Max
Sulfate11: 5.9-13.1 & Diff > 1 (14 observations)
floatbank10_sulfate 10.686 11.223 8.495 15.744
floatbank11_sulfate 8.314 8.476 6.321 11.343
sulfate_difference 2.303 2.748 1.003 7.280
Sulfate11: 5.9-13.1 & Diff 0-1 (4,945 observations)
floatbank10_sulfate 10.697 10.443 5.905 13.236
floatbank11_sulfate 10.687 10.435 5.905 13.098
sulfate_difference 0.002 0.008 0.000002 0.998
Sulfate11: 5.9-13.1 & Diff 0-(-1) (4,966 observations)
floatbank10_sulfate 10.629 10.436 5.914 13.095
floatbank11_sulfate 10.645 10.444 5.918 13.096
sulfate_difference -0.002 -0.007 -0.999 -0.000001
Sulfate11: 5.9-13.1 & Diff < -1 (28 observations)
floatbank10_sulfate 5.294 4.919 0.001 11.597
floatbank11_sulfate 8.681 9.475 6.435 13.004
sulfate_difference -3.120 -4.556 -12.978 -1.002
Sulfate11 ≤ 1 & Diff > 1 (402 observations)
floatbank10_sulfate 12.999 13.014 1.240 23.748
floatbank11_sulfate 0.029 0.028 0.000086 0.241
sulfate_difference 12.964 12.986 1.050 23.747
Sulfate11 ≤ 1 & Diff 0-1 (10 observations)
floatbank10_sulfate 0.043 0.307 0.001 1.352
floatbank11_sulfate 0.013 0.221 0.000049 0.961
sulfate_difference 0.023 0.086 0.0002 0.391
Sulfate11 ≤ 1 & Diff 0-(-1) (11 observations)
floatbank10_sulfate 0.009 0.210 0.002 0.676
floatbank11_sulfate 0.159 0.321 0.004 0.830
sulfate_difference -0.034 -0.111 -0.371 -0.001
Sulfate11 ≤ 1 & Diff < -1 (0 observations)
floatbank10_sulfate N/A N/A N/A N/A
floatbank11_sulfate N/A N/A N/A N/A
sulfate_difference N/A N/A N/A N/A

Key Observations

  • Near-equilibrium dominates: 9,911 out of 14,853 observations (66.7%) fall within the -1 to +1 difference range when sulfate11 is in range (5.9-13.1)
  • Low sulfate11 creates large positive differences: 402 observations with sulfate11 ≤ 1 show large positive differences (median: 12.964)
  • Extreme negative differences are rare: Only 28 observations show differences < -1 in the normal sulfate11 range (5.9 - 13.1)
  • Overall relationship is balanced: Median difference of 0.000085 indicates balance
  • Most data concentrated in range: The 5.9-13.1 sulfate11 range contains the majority of reliable data with small differences between sensors

Conclusion

Can safely input values from rougher.input.floatbank11_sulfate when in the range (5.9 - 13.1) into rougher.input.floatbank10_sulfate. The remaining 1.5% (222) missing values, from fb10_sulfate could potentially be from sensor errors and can be dropped to maintain balance.

In [72]:
# Look at the remaining NaN values
gold_train_new2.isna().sum()
Out[72]:
date                                            0
final.output.recovery                           0
primary_cleaner.input.sulfate                  49
primary_cleaner.input.depressant               61
primary_cleaner.input.feed_size                 0
primary_cleaner.input.xanthate                103
primary_cleaner.state.floatbank8_a_air          0
primary_cleaner.state.floatbank8_a_level        0
primary_cleaner.state.floatbank8_b_air          0
primary_cleaner.state.floatbank8_b_level        0
primary_cleaner.state.floatbank8_c_air          0
primary_cleaner.state.floatbank8_c_level        0
primary_cleaner.state.floatbank8_d_air          0
primary_cleaner.state.floatbank8_d_level        0
rougher.input.feed_ag                           0
rougher.input.feed_pb                           0
rougher.input.feed_rate                        42
rougher.input.feed_size                         0
rougher.input.feed_sol                          0
rougher.input.feed_au                           0
rougher.input.floatbank10_sulfate               0
rougher.input.floatbank10_xanthate              0
rougher.input.floatbank11_sulfate              62
rougher.input.floatbank11_xanthate              0
rougher.state.floatbank10_a_air                 0
rougher.state.floatbank10_a_level               0
rougher.state.floatbank10_b_air                 0
rougher.state.floatbank10_b_level               0
rougher.state.floatbank10_c_air                 0
rougher.state.floatbank10_c_level               0
rougher.state.floatbank10_d_air                 0
rougher.state.floatbank10_d_level               0
rougher.state.floatbank10_e_air                 0
rougher.state.floatbank10_e_level               0
rougher.state.floatbank10_f_air                 0
rougher.state.floatbank10_f_level               0
secondary_cleaner.state.floatbank2_a_air      220
secondary_cleaner.state.floatbank2_a_level      0
secondary_cleaner.state.floatbank2_b_air        0
secondary_cleaner.state.floatbank2_b_level      0
secondary_cleaner.state.floatbank3_a_air        0
secondary_cleaner.state.floatbank3_a_level      0
secondary_cleaner.state.floatbank3_b_air        0
secondary_cleaner.state.floatbank3_b_level      0
secondary_cleaner.state.floatbank4_a_air        0
secondary_cleaner.state.floatbank4_a_level      0
secondary_cleaner.state.floatbank4_b_air        0
secondary_cleaner.state.floatbank4_b_level      0
secondary_cleaner.state.floatbank5_a_air        0
secondary_cleaner.state.floatbank5_a_level      0
secondary_cleaner.state.floatbank5_b_air        0
secondary_cleaner.state.floatbank5_b_level      0
secondary_cleaner.state.floatbank6_a_air        0
secondary_cleaner.state.floatbank6_a_level      0
dtype: int64
In [73]:
# Almost all are missing <1% data EXCEPT secondary_cleaner.state.floatbank2_a_air
# Drop the NaN values < 1%

# Training Set
drop_columns = ['rougher.input.floatbank11_sulfate','rougher.input.feed_rate','primary_cleaner.input.xanthate','primary_cleaner.input.depressant','primary_cleaner.input.sulfate']
gold_train_new2 = gold_train_new2.dropna(subset = drop_columns)
display(gold_train_new2)

# Full Set
gold_full_new2 = gold_full_new2.dropna(subset = drop_columns)
gold_full_new2
date final.output.recovery primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
0 2016-01-15 00:00:00 70.541216 127.092003 10.128295 7.25 0.988759 1549.775757 -498.912140 1551.434204 -516.403442 ... 14.016835 -502.488007 12.099931 -504.715942 9.925633 -498.310211 8.079666 -500.470978 14.151341 -605.841980
1 2016-01-15 01:00:00 69.266198 125.629232 10.296251 7.25 1.002663 1576.166671 -500.904965 1575.950626 -499.865889 ... 13.992281 -505.503262 11.950531 -501.331529 10.039245 -500.169983 7.984757 -500.582168 13.998353 -599.787184
2 2016-01-15 02:00:00 68.116445 123.819808 11.316280 7.25 0.991265 1601.556163 -499.997791 1600.386685 -500.607762 ... 14.015015 -502.520901 11.912783 -501.133383 10.070913 -500.129135 8.013877 -500.517572 14.028663 -601.427363
3 2016-01-15 03:00:00 68.347543 122.270188 11.322140 7.25 0.996739 1599.968720 -500.951778 1600.659236 -499.677094 ... 14.036510 -500.857308 11.999550 -501.193686 9.970366 -499.201640 7.977324 -500.255908 14.005551 -599.996129
4 2016-01-15 04:00:00 66.927016 117.988169 11.913613 7.25 1.009869 1601.339707 -498.975456 1601.437854 -500.323246 ... 14.027298 -499.838632 11.953070 -501.053894 9.925709 -501.686727 7.894242 -500.356035 13.996647 -601.496691
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
16855 2018-08-18 06:59:59 73.755150 123.381787 8.028927 6.50 1.304232 1648.421193 -400.382169 1648.742005 -400.359661 ... 23.031497 -501.167942 20.007571 -499.740028 18.006038 -499.834374 13.001114 -500.155694 20.007840 -501.296428
16856 2018-08-18 07:59:59 69.049291 120.878188 7.962636 6.50 1.302419 1649.820162 -399.930973 1649.357538 -399.721222 ... 22.960095 -501.612783 20.035660 -500.251357 17.998535 -500.395178 12.954048 -499.895163 19.968498 -501.041608
16857 2018-08-18 08:59:59 67.002189 105.666118 7.955111 6.50 1.315926 1649.166761 -399.888631 1649.196904 -399.677571 ... 23.015718 -501.711599 19.951231 -499.857027 18.019543 -500.451156 13.023431 -499.914391 19.990885 -501.518452
16858 2018-08-18 09:59:59 65.523246 98.880538 7.984164 6.50 1.241969 1646.547763 -398.977083 1648.212240 -400.383265 ... 23.024963 -501.153409 20.054122 -500.314711 17.979515 -499.272871 12.992404 -499.976268 20.013986 -500.625471
16859 2018-08-18 10:59:59 70.281454 95.248427 8.078957 6.50 1.283045 1648.759906 -399.862053 1650.135395 -399.957321 ... 23.018622 -500.492702 20.020205 -500.220296 17.963512 -499.939490 12.990306 -500.080993 19.990336 -499.191575

14434 rows × 54 columns

Out[73]:
date final.output.recovery primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
0 2016-01-15 00:00:00 70.541216 127.092003 10.128295 7.25 0.988759 1549.775757 -498.912140 1551.434204 -516.403442 ... 14.016835 -502.488007 12.099931 -504.715942 9.925633 -498.310211 8.079666 -500.470978 14.151341 -605.841980
1 2016-01-15 01:00:00 69.266198 125.629232 10.296251 7.25 1.002663 1576.166671 -500.904965 1575.950626 -499.865889 ... 13.992281 -505.503262 11.950531 -501.331529 10.039245 -500.169983 7.984757 -500.582168 13.998353 -599.787184
2 2016-01-15 02:00:00 68.116445 123.819808 11.316280 7.25 0.991265 1601.556163 -499.997791 1600.386685 -500.607762 ... 14.015015 -502.520901 11.912783 -501.133383 10.070913 -500.129135 8.013877 -500.517572 14.028663 -601.427363
3 2016-01-15 03:00:00 68.347543 122.270188 11.322140 7.25 0.996739 1599.968720 -500.951778 1600.659236 -499.677094 ... 14.036510 -500.857308 11.999550 -501.193686 9.970366 -499.201640 7.977324 -500.255908 14.005551 -599.996129
4 2016-01-15 04:00:00 66.927016 117.988169 11.913613 7.25 1.009869 1601.339707 -498.975456 1601.437854 -500.323246 ... 14.027298 -499.838632 11.953070 -501.053894 9.925709 -501.686727 7.894242 -500.356035 13.996647 -601.496691
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
22711 2018-08-18 06:59:59 73.755150 123.381787 8.028927 6.50 1.304232 1648.421193 -400.382169 1648.742005 -400.359661 ... 23.031497 -501.167942 20.007571 -499.740028 18.006038 -499.834374 13.001114 -500.155694 20.007840 -501.296428
22712 2018-08-18 07:59:59 69.049291 120.878188 7.962636 6.50 1.302419 1649.820162 -399.930973 1649.357538 -399.721222 ... 22.960095 -501.612783 20.035660 -500.251357 17.998535 -500.395178 12.954048 -499.895163 19.968498 -501.041608
22713 2018-08-18 08:59:59 67.002189 105.666118 7.955111 6.50 1.315926 1649.166761 -399.888631 1649.196904 -399.677571 ... 23.015718 -501.711599 19.951231 -499.857027 18.019543 -500.451156 13.023431 -499.914391 19.990885 -501.518452
22714 2018-08-18 09:59:59 65.523246 98.880538 7.984164 6.50 1.241969 1646.547763 -398.977083 1648.212240 -400.383265 ... 23.024963 -501.153409 20.054122 -500.314711 17.979515 -499.272871 12.992404 -499.976268 20.013986 -500.625471
22715 2018-08-18 10:59:59 70.281454 95.248427 8.078957 6.50 1.283045 1648.759906 -399.862053 1650.135395 -399.957321 ... 23.018622 -500.492702 20.020205 -500.220296 17.963512 -499.939490 12.990306 -500.080993 19.990336 -499.191575

19788 rows × 54 columns

In [74]:
# Look into the last column `secondary_cleaner.state.floatbank2_a_air` with 1.5% missing data
gold_train_new2[gold_train_new2['secondary_cleaner.state.floatbank2_a_air'].isna()]

fb2 = gold_train_new2.filter(like = 'floatbank2', axis = 1).copy()
fb2 = fb2.filter(like = 'air', axis = 1)

fb2.hist(figsize = [10,5])
plt.show()

# Inspect the difference

fb2['a_b_difference'] = fb2['secondary_cleaner.state.floatbank2_a_air'] - fb2['secondary_cleaner.state.floatbank2_b_air']
fb2[(fb2['a_b_difference'] > -15) & (fb2['a_b_difference'] < 15)]

fb2[fb2['secondary_cleaner.state.floatbank2_b_air'] >= 35].max()

fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] >= 30) & (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35)].max()

fb2[(fb2['secondary_cleaner.state.floatbank2_a_air'].isna()) & fb2['secondary_cleaner.state.floatbank2_b_air']]

fb2[(fb2['secondary_cleaner.state.floatbank2_a_air'].isna())].max()
No description has been provided for this image
Out[74]:
secondary_cleaner.state.floatbank2_a_air          NaN
secondary_cleaner.state.floatbank2_b_air    35.038455
a_b_difference                                    NaN
dtype: float64
In [75]:
fb2
Out[75]:
secondary_cleaner.state.floatbank2_a_air secondary_cleaner.state.floatbank2_b_air a_b_difference
0 25.853109 23.893660 1.959450
1 25.880539 23.889530 1.991009
2 26.005245 23.886657 2.118588
3 25.942508 23.955516 1.986991
4 26.024787 23.955345 2.069442
... ... ... ...
16855 35.043205 29.906659 5.136546
16856 35.026062 29.921795 5.104267
16857 35.003586 29.990533 5.013053
16858 34.980742 29.968453 5.012288
16859 34.940919 30.031867 4.909052

14434 rows × 3 columns

In [76]:
# Only look at secondary_cleaner.state.floatbank2_b_air values between 22.9 & 35.1
fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] >= 22.9) & (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1)]

print("DataFrame")
display(fb2)
print()
print("Median")
display(fb2.median())
print()
print("Mean")
display(fb2.mean())
print()
print("Min")
display(fb2.min())
print()
print("Max")
display(fb2.max())
print()
print("Length")
display(len(fb2))
print()
print()

print(f"fb_b between 22.9 - 35.1 & Differences Greater Than 5")
print("-----------------------------------")
print("Median")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] > 5)].median())
print()
print("Mean")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] > 5)].mean())
print()
print("Min")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] > 5)].min())
print()
print("Max")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] > 5)].max())
print()
print("Length")
display(len(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] > 5)]))
print()
print()

print(f"fb_b between 22.9 - 35.1 & Difference Between 1 - 5")
print("-----------------------------------")
print("Median")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] > 1) & 
    (fb2['a_b_difference'] <= 5)].median())
print()
print("Mean")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] > 1) & 
    (fb2['a_b_difference'] <= 5)].mean())
print()
print("Min")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] > 1) & 
    (fb2['a_b_difference'] <= 5)].min())
print()
print("Max")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] > 1) & 
    (fb2['a_b_difference'] <= 5)].max())
print()
print("Length")
display(len(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] > 1) & 
    (fb2['a_b_difference'] <= 5)]))
print()
print()

print(f"fb_b between 22.9 - 35.1 & Difference Between 0 - 1")
print("-----------------------------------")
print("Median")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] >= 0) & 
    (fb2['a_b_difference'] <= 1)].median())
print()
print("Mean")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] >= 0) & 
    (fb2['a_b_difference'] <= 1)].mean())
print()
print("Min")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] >= 0) & 
    (fb2['a_b_difference'] <= 1)].min())
print()
print("Max")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] >= 0) & 
    (fb2['a_b_difference'] <= 1)].max())
print()
print("Length")
display(len(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] >= 0) & 
    (fb2['a_b_difference'] <= 1)]))
print()
print()

print(f"fb_b between 22.9 - 35.1 & Difference Between 0 - (-)1")
print("-----------------------------------")
print("Median")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < 0) & 
    (fb2['a_b_difference'] >= -1)].median())
print()
print("Mean")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < 0) & 
    (fb2['a_b_difference'] >= -1)].mean())
print()
print("Min")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < 0) & 
    (fb2['a_b_difference'] >= -1)].min())
print()
print("Max")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < 0) & 
    (fb2['a_b_difference'] >= -1)].max())
print()
print("Length")
display(len(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < 0) & 
    (fb2['a_b_difference'] >= -1)]))
print()
print()

print(f"fb_b between 22.9 - 35.1 & Difference Between (-)1 - (-)5")
print("-----------------------------------")
print("Median")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < -1) & 
    (fb2['a_b_difference'] >= -5)].median())
print()
print("Mean")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < -1) & 
    (fb2['a_b_difference'] >= -5)].mean())
print()
print("Min")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < -1) & 
    (fb2['a_b_difference'] >= -5)].min())
print()
print("Max")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < -1) & 
    (fb2['a_b_difference'] >= -5)].max())
print()
print("Length")
display(len(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < -1) & 
    (fb2['a_b_difference'] >= -5)]))
print()
print()

print(f"fb_b between 22.9 - 35.1 & Differences Greater Than -5")
print("-----------------------------------")
print("Median")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < -5)].median())
print()
print("Mean")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < -5)].mean())
print()
print("Min")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < -5)].min())
print()
print("Max")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < -5)].max())
print()
print("Length")
display(len(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < -5)]))
print()
print("HISTOGRAM: B(22.9 - 35.1)")
fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 22.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] >= -15) & 
    (fb2['a_b_difference'] <= 25)].hist(figsize=[10,5])
plt.show()
DataFrame
secondary_cleaner.state.floatbank2_a_air secondary_cleaner.state.floatbank2_b_air a_b_difference
0 25.853109 23.893660 1.959450
1 25.880539 23.889530 1.991009
2 26.005245 23.886657 2.118588
3 25.942508 23.955516 1.986991
4 26.024787 23.955345 2.069442
... ... ... ...
16855 35.043205 29.906659 5.136546
16856 35.026062 29.921795 5.104267
16857 35.003586 29.990533 5.013053
16858 34.980742 29.968453 5.012288
16859 34.940919 30.031867 4.909052

14434 rows × 3 columns

Median
secondary_cleaner.state.floatbank2_a_air    30.023463
secondary_cleaner.state.floatbank2_b_air    27.021564
a_b_difference                               4.888594
dtype: float64
Mean
secondary_cleaner.state.floatbank2_a_air    29.654285
secondary_cleaner.state.floatbank2_b_air    25.029451
a_b_difference                               4.684958
dtype: float64
Min
secondary_cleaner.state.floatbank2_a_air     0.122013
secondary_cleaner.state.floatbank2_b_air     0.000000
a_b_difference                             -13.536875
dtype: float64
Max
secondary_cleaner.state.floatbank2_a_air    52.651399
secondary_cleaner.state.floatbank2_b_air    35.152122
a_b_difference                              24.654966
dtype: float64
Length
14434

fb_b between 22.9 - 35.1 & Differences Greater Than 5
-----------------------------------
Median
secondary_cleaner.state.floatbank2_a_air    34.973171
secondary_cleaner.state.floatbank2_b_air    27.943509
a_b_difference                               6.975645
dtype: float64
Mean
secondary_cleaner.state.floatbank2_a_air    34.252227
secondary_cleaner.state.floatbank2_b_air    27.378102
a_b_difference                               6.874125
dtype: float64
Min
secondary_cleaner.state.floatbank2_a_air    27.951685
secondary_cleaner.state.floatbank2_b_air    22.900440
a_b_difference                               5.000032
dtype: float64
Max
secondary_cleaner.state.floatbank2_a_air    52.651399
secondary_cleaner.state.floatbank2_b_air    33.067334
a_b_difference                              24.654966
dtype: float64
Length
3717

fb_b between 22.9 - 35.1 & Difference Between 1 - 5
-----------------------------------
Median
secondary_cleaner.state.floatbank2_a_air    30.051628
secondary_cleaner.state.floatbank2_b_air    28.010539
a_b_difference                               2.122420
dtype: float64
Mean
secondary_cleaner.state.floatbank2_a_air    30.762827
secondary_cleaner.state.floatbank2_b_air    27.988947
a_b_difference                               2.773880
dtype: float64
Min
secondary_cleaner.state.floatbank2_a_air    24.822521
secondary_cleaner.state.floatbank2_b_air    22.900248
a_b_difference                               1.000237
dtype: float64
Max
secondary_cleaner.state.floatbank2_a_air    37.975584
secondary_cleaner.state.floatbank2_b_air    33.073671
a_b_difference                               4.999981
dtype: float64
Length
6293

fb_b between 22.9 - 35.1 & Difference Between 0 - 1
-----------------------------------
Median
secondary_cleaner.state.floatbank2_a_air    29.880223
secondary_cleaner.state.floatbank2_b_air    28.937471
a_b_difference                               0.944507
dtype: float64
Mean
secondary_cleaner.state.floatbank2_a_air    28.657667
secondary_cleaner.state.floatbank2_b_air    27.807391
a_b_difference                               0.850276
dtype: float64
Min
secondary_cleaner.state.floatbank2_a_air    24.816333
secondary_cleaner.state.floatbank2_b_air    23.872494
a_b_difference                               0.000409
dtype: float64
Max
secondary_cleaner.state.floatbank2_a_air    35.442058
secondary_cleaner.state.floatbank2_b_air    35.048410
a_b_difference                               0.999746
dtype: float64
Length
407

fb_b between 22.9 - 35.1 & Difference Between 0 - (-)1
-----------------------------------
Median
secondary_cleaner.state.floatbank2_a_air    31.988556
secondary_cleaner.state.floatbank2_b_air    32.108819
a_b_difference                              -0.099835
dtype: float64
Mean
secondary_cleaner.state.floatbank2_a_air    32.849083
secondary_cleaner.state.floatbank2_b_air    33.072617
a_b_difference                              -0.223534
dtype: float64
Min
secondary_cleaner.state.floatbank2_a_air    26.737957
secondary_cleaner.state.floatbank2_b_air    26.978013
a_b_difference                              -0.989676
dtype: float64
Max
secondary_cleaner.state.floatbank2_a_air    35.053086
secondary_cleaner.state.floatbank2_b_air    35.089583
a_b_difference                              -0.005577
dtype: float64
Length
57

fb_b between 22.9 - 35.1 & Difference Between (-)1 - (-)5
-----------------------------------
Median
secondary_cleaner.state.floatbank2_a_air    28.915830
secondary_cleaner.state.floatbank2_b_air    30.025430
a_b_difference                              -1.670835
dtype: float64
Mean
secondary_cleaner.state.floatbank2_a_air    28.713275
secondary_cleaner.state.floatbank2_b_air    30.507579
a_b_difference                              -1.794304
dtype: float64
Min
secondary_cleaner.state.floatbank2_a_air    23.288626
secondary_cleaner.state.floatbank2_b_air    25.487310
a_b_difference                              -3.671505
dtype: float64
Max
secondary_cleaner.state.floatbank2_a_air    33.596561
secondary_cleaner.state.floatbank2_b_air    35.062679
a_b_difference                              -1.004651
dtype: float64
Length
26

fb_b between 22.9 - 35.1 & Differences Greater Than -5
-----------------------------------
Median
secondary_cleaner.state.floatbank2_a_air    17.152828
secondary_cleaner.state.floatbank2_b_air    26.577827
a_b_difference                              -9.424999
dtype: float64
Mean
secondary_cleaner.state.floatbank2_a_air    17.152828
secondary_cleaner.state.floatbank2_b_air    26.577827
a_b_difference                              -9.424999
dtype: float64
Min
secondary_cleaner.state.floatbank2_a_air    11.663794
secondary_cleaner.state.floatbank2_b_air    25.200668
a_b_difference                             -13.536875
dtype: float64
Max
secondary_cleaner.state.floatbank2_a_air    22.641862
secondary_cleaner.state.floatbank2_b_air    27.954986
a_b_difference                              -5.313124
dtype: float64
Length
2
HISTOGRAM: B(22.9 - 35.1)
No description has been provided for this image
In [77]:
# Code used to find ranges within the main range (22.9 - 35.1)
with pd.option_context('display.max_columns',None):
    display(len(gold_train_new2[(gold_train_new2['secondary_cleaner.state.floatbank2_b_air'] >= 31.9) & 
        (gold_train_new2['secondary_cleaner.state.floatbank2_b_air'] <= 32.1) & 
        (gold_train_new2['secondary_cleaner.state.floatbank2_a_air'].isna())]))
7
In [78]:
print("Grouped By fb_b in Increments")
print("-------------------------------------------")
print("B =  29.9 - 30.096: > 8")
print("-----------------------------------")
print("Median")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 29.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 30.096) & (fb2['a_b_difference'] > 8)].median())
print()
print("Mean")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 29.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 30.096) & (fb2['a_b_difference'] > 8)].mean())
print()
print("Min")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 29.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 30.096) & (fb2['a_b_difference'] > 8)].min())
print()
print("Max")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 29.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 30.096) & (fb2['a_b_difference'] > 8)].max())
print()
print("Length")
display(len(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 29.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 30.096) & (fb2['a_b_difference'] > 8)]))
print()
print()
print("B =  29.9 - 30.096: 2 - 8")
print("-----------------------------------")
print("Median")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 29.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 30.096) & (fb2['a_b_difference'] >= 2) & 
    (fb2['a_b_difference'] <= 8)].median())
print()
print("Mean")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 29.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 30.096) & (fb2['a_b_difference'] >= 2) & 
    (fb2['a_b_difference'] <= 8)].mean())
print()
print("Min")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 29.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 30.096) & (fb2['a_b_difference'] >= 2) & 
    (fb2['a_b_difference'] <= 8)].min())
print()
print("Max")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 29.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 30.096) & (fb2['a_b_difference'] >= 2) & 
    (fb2['a_b_difference'] <= 8)].max())
print()
print("Length")
display(len(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 29.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 30.096) & (fb2['a_b_difference'] >= 2) & 
    (fb2['a_b_difference'] <= 8)]))
print()
print()
print("B =  29.9 - 30.096: < 2")
print("-----------------------------------")
print("Median")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 29.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 30.096) & (fb2['a_b_difference'] < 2)].median())
print()
print("Mean")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 29.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 30.096) & (fb2['a_b_difference'] < 2)].mean())
print()
print("Min")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 29.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 30.096) & (fb2['a_b_difference'] < 2)].min())
print()
print("Max")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 29.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 30.096) & (fb2['a_b_difference'] < 2)].max())
print()
print("Length")
display(len(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 29.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 30.096) & (fb2['a_b_difference'] < 2)]))
print()
print("HISTOGRAM: B(29.9 - 30.096)")
fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 29.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 30.096) & (fb2['a_b_difference'] >= 0) & 
    (fb2['a_b_difference'] <= 15)].hist(figsize=[10,5])
plt.title("B(29.9 - 30.096)")
plt.show()
Grouped By fb_b in Increments
-------------------------------------------
B =  29.9 - 30.096: > 8
-----------------------------------
Median
secondary_cleaner.state.floatbank2_a_air    40.752771
secondary_cleaner.state.floatbank2_b_air    29.998159
a_b_difference                              10.773487
dtype: float64
Mean
secondary_cleaner.state.floatbank2_a_air    41.057590
secondary_cleaner.state.floatbank2_b_air    29.996569
a_b_difference                              11.061021
dtype: float64
Min
secondary_cleaner.state.floatbank2_a_air    38.028825
secondary_cleaner.state.floatbank2_b_air    29.937648
a_b_difference                               8.022474
dtype: float64
Max
secondary_cleaner.state.floatbank2_a_air    48.822781
secondary_cleaner.state.floatbank2_b_air    30.070166
a_b_difference                              18.830586
dtype: float64
Length
64

B =  29.9 - 30.096: 2 - 8
-----------------------------------
Median
secondary_cleaner.state.floatbank2_a_air    34.988451
secondary_cleaner.state.floatbank2_b_air    29.997182
a_b_difference                               4.991009
dtype: float64
Mean
secondary_cleaner.state.floatbank2_a_air    34.986608
secondary_cleaner.state.floatbank2_b_air    29.997998
a_b_difference                               4.988610
dtype: float64
Min
secondary_cleaner.state.floatbank2_a_air    32.008632
secondary_cleaner.state.floatbank2_b_air    29.900325
a_b_difference                               2.001851
dtype: float64
Max
secondary_cleaner.state.floatbank2_a_air    37.964324
secondary_cleaner.state.floatbank2_b_air    30.094222
a_b_difference                               7.976967
dtype: float64
Length
1759

B =  29.9 - 30.096: < 2
-----------------------------------
Median
secondary_cleaner.state.floatbank2_a_air    31.330761
secondary_cleaner.state.floatbank2_b_air    30.005149
a_b_difference                               1.363140
dtype: float64
Mean
secondary_cleaner.state.floatbank2_a_air    30.710582
secondary_cleaner.state.floatbank2_b_air    30.003108
a_b_difference                               0.707474
dtype: float64
Min
secondary_cleaner.state.floatbank2_a_air    28.278860
secondary_cleaner.state.floatbank2_b_air    29.935529
a_b_difference                              -1.776456
dtype: float64
Max
secondary_cleaner.state.floatbank2_a_air    32.019074
secondary_cleaner.state.floatbank2_b_air    30.065564
a_b_difference                               1.999661
dtype: float64
Length
26
HISTOGRAM: B(29.9 - 30.096)
No description has been provided for this image
In [79]:
print("B =  30.8 - 30.9: > 5")
print("-----------------------------------")
print("Median")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] >= 30.8) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 30.9) & (fb2['a_b_difference'] > 5)].median())
print()
print("Mean")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] >= 30.8) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 30.9) & (fb2['a_b_difference'] > 5)].mean())
print()
print("Min")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] >= 30.8) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 30.9) & (fb2['a_b_difference'] > 5)].min())
print()
print("Max")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] >= 30.8) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 30.9) & (fb2['a_b_difference'] > 5)].max())
print()
print("Length")
display(len(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] >= 30.8) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 30.9) & (fb2['a_b_difference'] > 5)]))
print()
print()
print("B =  30.8 - 30.9: 1 - 5")
print("-----------------------------------")
print("Median")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] >= 30.8) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 30.9) & (fb2['a_b_difference'] >= 1) & 
    (fb2['a_b_difference'] <= 5)].median())
print()
print("Mean")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] >= 30.8) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 30.9) & (fb2['a_b_difference'] >= 1) & 
    (fb2['a_b_difference'] <= 5)].mean())
print()
print("Min")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] >= 30.8) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 30.9) & (fb2['a_b_difference'] >= 1) & 
    (fb2['a_b_difference'] <= 5)].min())
print()
print("Max")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] >= 30.8) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 30.9) & (fb2['a_b_difference'] >= 1) & 
    (fb2['a_b_difference'] <= 5)].max())
print()
print("Length")
display(len(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] >= 30.8) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 30.9) & (fb2['a_b_difference'] >= 1) & 
    (fb2['a_b_difference'] <= 5)]))
print()
print()
print("B =  30.8 - 30.9: < 1")
print("-----------------------------------")
print("Median")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] >= 30.8) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 30.9) & (fb2['a_b_difference'] < 1)].median())
print()
print("Mean")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] >= 30.8) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 30.9) & (fb2['a_b_difference'] < 1)].mean())
print()
print("Min")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] >= 30.8) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 30.9) & (fb2['a_b_difference'] < 1)].min())
print()
print("Max")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] >= 30.8) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 30.9) & (fb2['a_b_difference'] < 1)].max())
print()
print("Length")
display(len(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] >= 30.8) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 30.9) & (fb2['a_b_difference'] < 1)]))
print()
print("HISTOGRAM: B(30.8 - 30.9)")
fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 30.8) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 30.9) & (fb2['a_b_difference'] >= -50) & 
    (fb2['a_b_difference'] <= 50)].hist(figsize=[10,5])
plt.show()
B =  30.8 - 30.9: > 5
-----------------------------------
Median
secondary_cleaner.state.floatbank2_a_air   NaN
secondary_cleaner.state.floatbank2_b_air   NaN
a_b_difference                             NaN
dtype: float64
Mean
secondary_cleaner.state.floatbank2_a_air   NaN
secondary_cleaner.state.floatbank2_b_air   NaN
a_b_difference                             NaN
dtype: float64
Min
secondary_cleaner.state.floatbank2_a_air   NaN
secondary_cleaner.state.floatbank2_b_air   NaN
a_b_difference                             NaN
dtype: float64
Max
secondary_cleaner.state.floatbank2_a_air   NaN
secondary_cleaner.state.floatbank2_b_air   NaN
a_b_difference                             NaN
dtype: float64
Length
0

B =  30.8 - 30.9: 1 - 5
-----------------------------------
Median
secondary_cleaner.state.floatbank2_a_air    32.054830
secondary_cleaner.state.floatbank2_b_air    30.881083
a_b_difference                               1.170556
dtype: float64
Mean
secondary_cleaner.state.floatbank2_a_air    32.429425
secondary_cleaner.state.floatbank2_b_air    30.874578
a_b_difference                               1.554847
dtype: float64
Min
secondary_cleaner.state.floatbank2_a_air    31.968334
secondary_cleaner.state.floatbank2_b_air    30.830393
a_b_difference                               1.088165
dtype: float64
Max
secondary_cleaner.state.floatbank2_a_air    33.002983
secondary_cleaner.state.floatbank2_b_air    30.884599
a_b_difference                               2.164216
dtype: float64
Length
7

B =  30.8 - 30.9: < 1
-----------------------------------
Median
secondary_cleaner.state.floatbank2_a_air   NaN
secondary_cleaner.state.floatbank2_b_air   NaN
a_b_difference                             NaN
dtype: float64
Mean
secondary_cleaner.state.floatbank2_a_air   NaN
secondary_cleaner.state.floatbank2_b_air   NaN
a_b_difference                             NaN
dtype: float64
Min
secondary_cleaner.state.floatbank2_a_air   NaN
secondary_cleaner.state.floatbank2_b_air   NaN
a_b_difference                             NaN
dtype: float64
Max
secondary_cleaner.state.floatbank2_a_air   NaN
secondary_cleaner.state.floatbank2_b_air   NaN
a_b_difference                             NaN
dtype: float64
Length
0
HISTOGRAM: B(30.8 - 30.9)
No description has been provided for this image
In [80]:
print("B =  31.9 - 32.1: > 4")
print("-----------------------------------")
print("Median")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 31.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 32.1) & (fb2['a_b_difference'] > 4)].median())
print()
print("Mean")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 31.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 32.1) & (fb2['a_b_difference'] > 4)].mean())
print()
print("Min")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 31.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 32.1) & (fb2['a_b_difference'] > 4)].min())
print()
print("Max")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 31.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 32.1) & (fb2['a_b_difference'] > 4)].max())
print()
print("Length")
display(len(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 31.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 32.1) & (fb2['a_b_difference'] > 4)]))
print()
print()
print("B =  31.9 - 32.1: 1 - 4")
print("-----------------------------------")
print("Median")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 31.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 32.1) & (fb2['a_b_difference'] >= 1) & 
    (fb2['a_b_difference'] <= 4)].median())
print()
print("Mean")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 31.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 32.1) & (fb2['a_b_difference'] >= 1) & 
    (fb2['a_b_difference'] <= 4)].mean())
print()
print("Min")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 31.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 32.1) & (fb2['a_b_difference'] >= 1) & 
    (fb2['a_b_difference'] <= 4)].min())
print()
print("Max")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 31.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 32.1) & (fb2['a_b_difference'] >= 1) & 
    (fb2['a_b_difference'] <= 4)].max())
print()
print("Length")
display(len(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 31.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 32.1) & (fb2['a_b_difference'] >= 1) & 
    (fb2['a_b_difference'] <= 4)]))
print()
print()
print("B =  31.9 - 32.1: < 1")
print("-----------------------------------")
print("Median")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 31.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 32.1) & (fb2['a_b_difference'] < 1)].median())
print()
print("Mean")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 31.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 32.1) & (fb2['a_b_difference'] < 1)].mean())
print()
print("Min")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 31.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 32.1) & (fb2['a_b_difference'] < 1)].min())
print()
print("Max")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 31.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 32.1) & (fb2['a_b_difference'] < 1)].max())
print()
print("Length")
display(len(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 31.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] <= 32.1) & (fb2['a_b_difference'] < 1)]))
print()
print("HISTOGRAM: B(31.9 - 32.1)")
fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 31.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 32.1) & (fb2['a_b_difference'] >= -2) & 
    (fb2['a_b_difference'] <= 5)].hist(figsize=[10,5])
plt.show()
B =  31.9 - 32.1: > 4
-----------------------------------
Median
secondary_cleaner.state.floatbank2_a_air    36.146185
secondary_cleaner.state.floatbank2_b_air    32.030408
a_b_difference                               4.115778
dtype: float64
Mean
secondary_cleaner.state.floatbank2_a_air    36.146185
secondary_cleaner.state.floatbank2_b_air    32.030408
a_b_difference                               4.115778
dtype: float64
Min
secondary_cleaner.state.floatbank2_a_air    36.146185
secondary_cleaner.state.floatbank2_b_air    32.030408
a_b_difference                               4.115778
dtype: float64
Max
secondary_cleaner.state.floatbank2_a_air    36.146185
secondary_cleaner.state.floatbank2_b_air    32.030408
a_b_difference                               4.115778
dtype: float64
Length
1

B =  31.9 - 32.1: 1 - 4
-----------------------------------
Median
secondary_cleaner.state.floatbank2_a_air    35.000732
secondary_cleaner.state.floatbank2_b_air    32.010188
a_b_difference                               2.991053
dtype: float64
Mean
secondary_cleaner.state.floatbank2_a_air    34.903813
secondary_cleaner.state.floatbank2_b_air    32.006914
a_b_difference                               2.896899
dtype: float64
Min
secondary_cleaner.state.floatbank2_a_air    33.557849
secondary_cleaner.state.floatbank2_b_air    31.900684
a_b_difference                               1.606131
dtype: float64
Max
secondary_cleaner.state.floatbank2_a_air    35.381552
secondary_cleaner.state.floatbank2_b_air    32.099894
a_b_difference                               3.453636
dtype: float64
Length
509

B =  31.9 - 32.1: < 1
-----------------------------------
Median
secondary_cleaner.state.floatbank2_a_air    31.965927
secondary_cleaner.state.floatbank2_b_air    32.012205
a_b_difference                              -0.049059
dtype: float64
Mean
secondary_cleaner.state.floatbank2_a_air    31.899670
secondary_cleaner.state.floatbank2_b_air    32.004511
a_b_difference                              -0.104841
dtype: float64
Min
secondary_cleaner.state.floatbank2_a_air    29.985472
secondary_cleaner.state.floatbank2_b_air    31.905816
a_b_difference                              -1.968312
dtype: float64
Max
secondary_cleaner.state.floatbank2_a_air    32.600791
secondary_cleaner.state.floatbank2_b_air    32.096872
a_b_difference                               0.584643
dtype: float64
Length
32
HISTOGRAM: B(31.9 - 32.1)
No description has been provided for this image
In [81]:
print("B =  34.9 - 35.1: > 5")
print("-----------------------------------")
print("Median")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 34.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] > 5)].median())
print()
print("Mean")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 34.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] > 5)].mean())
print()
print("Min")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 34.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] > 5)].min())
print()
print("Max")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 34.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] > 5)].max())
print()
print("Length")
display(len(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 34.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] > 5)]))
print()
print()
print("B =  34.9 - 35.1: 1 - 5")
print("-----------------------------------")
print("Median")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 34.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] >= 1) & 
    (fb2['a_b_difference'] <= 5)].median())
print()
print("Mean")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 34.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] >= 1) & 
    (fb2['a_b_difference'] <= 5)].mean())
print()
print("Min")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 34.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] >= 1) & 
    (fb2['a_b_difference'] <= 5)].min())
print()
print("Max")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 34.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] >= 1) & 
    (fb2['a_b_difference'] <= 5)].max())
print()
print("Length")
display(len(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 34.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] >= 1) & 
    (fb2['a_b_difference'] <= 5)]))
print()
print()
print("B =  34.9 - 35.1: < 1")
print("-----------------------------------")
print("Median")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 34.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < 1)].median())
print()
print("Mean")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 34.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < 1)].mean())
print()
print("Min")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 34.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < 1)].min())
print()
print("Max")
display(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 34.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < 1)].max())
print()
print("Length")
display(len(fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 34.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] < 1)]))
print()
print("HISTOGRAM: B(34.9 - 35.1)")
fb2[(fb2['secondary_cleaner.state.floatbank2_b_air'] > 34.9) & 
    (fb2['secondary_cleaner.state.floatbank2_b_air'] < 35.1) & (fb2['a_b_difference'] >= -2) & 
    (fb2['a_b_difference'] <= 1)].hist(figsize=[10,5])
plt.title("B(34.9 - 35.1)")
plt.show()
B =  34.9 - 35.1: > 5
-----------------------------------
Median
secondary_cleaner.state.floatbank2_a_air   NaN
secondary_cleaner.state.floatbank2_b_air   NaN
a_b_difference                             NaN
dtype: float64
Mean
secondary_cleaner.state.floatbank2_a_air   NaN
secondary_cleaner.state.floatbank2_b_air   NaN
a_b_difference                             NaN
dtype: float64
Min
secondary_cleaner.state.floatbank2_a_air   NaN
secondary_cleaner.state.floatbank2_b_air   NaN
a_b_difference                             NaN
dtype: float64
Max
secondary_cleaner.state.floatbank2_a_air   NaN
secondary_cleaner.state.floatbank2_b_air   NaN
a_b_difference                             NaN
dtype: float64
Length
0

B =  34.9 - 35.1: 1 - 5
-----------------------------------
Median
secondary_cleaner.state.floatbank2_a_air   NaN
secondary_cleaner.state.floatbank2_b_air   NaN
a_b_difference                             NaN
dtype: float64
Mean
secondary_cleaner.state.floatbank2_a_air   NaN
secondary_cleaner.state.floatbank2_b_air   NaN
a_b_difference                             NaN
dtype: float64
Min
secondary_cleaner.state.floatbank2_a_air   NaN
secondary_cleaner.state.floatbank2_b_air   NaN
a_b_difference                             NaN
dtype: float64
Max
secondary_cleaner.state.floatbank2_a_air   NaN
secondary_cleaner.state.floatbank2_b_air   NaN
a_b_difference                             NaN
dtype: float64
Length
0

B =  34.9 - 35.1: < 1
-----------------------------------
Median
secondary_cleaner.state.floatbank2_a_air    34.993688
secondary_cleaner.state.floatbank2_b_air    35.010702
a_b_difference                              -0.021907
dtype: float64
Mean
secondary_cleaner.state.floatbank2_a_air    34.782362
secondary_cleaner.state.floatbank2_b_air    35.010515
a_b_difference                              -0.228153
dtype: float64
Min
secondary_cleaner.state.floatbank2_a_air    32.756136
secondary_cleaner.state.floatbank2_b_air    34.909409
a_b_difference                              -2.290366
dtype: float64
Max
secondary_cleaner.state.floatbank2_a_air    35.442058
secondary_cleaner.state.floatbank2_b_air    35.089583
a_b_difference                               0.393647
dtype: float64
Length
50
HISTOGRAM: B(34.9 - 35.1)
No description has been provided for this image
In [82]:
gold_train_new2.isna().sum()
Out[82]:
date                                            0
final.output.recovery                           0
primary_cleaner.input.sulfate                   0
primary_cleaner.input.depressant                0
primary_cleaner.input.feed_size                 0
primary_cleaner.input.xanthate                  0
primary_cleaner.state.floatbank8_a_air          0
primary_cleaner.state.floatbank8_a_level        0
primary_cleaner.state.floatbank8_b_air          0
primary_cleaner.state.floatbank8_b_level        0
primary_cleaner.state.floatbank8_c_air          0
primary_cleaner.state.floatbank8_c_level        0
primary_cleaner.state.floatbank8_d_air          0
primary_cleaner.state.floatbank8_d_level        0
rougher.input.feed_ag                           0
rougher.input.feed_pb                           0
rougher.input.feed_rate                         0
rougher.input.feed_size                         0
rougher.input.feed_sol                          0
rougher.input.feed_au                           0
rougher.input.floatbank10_sulfate               0
rougher.input.floatbank10_xanthate              0
rougher.input.floatbank11_sulfate               0
rougher.input.floatbank11_xanthate              0
rougher.state.floatbank10_a_air                 0
rougher.state.floatbank10_a_level               0
rougher.state.floatbank10_b_air                 0
rougher.state.floatbank10_b_level               0
rougher.state.floatbank10_c_air                 0
rougher.state.floatbank10_c_level               0
rougher.state.floatbank10_d_air                 0
rougher.state.floatbank10_d_level               0
rougher.state.floatbank10_e_air                 0
rougher.state.floatbank10_e_level               0
rougher.state.floatbank10_f_air                 0
rougher.state.floatbank10_f_level               0
secondary_cleaner.state.floatbank2_a_air      217
secondary_cleaner.state.floatbank2_a_level      0
secondary_cleaner.state.floatbank2_b_air        0
secondary_cleaner.state.floatbank2_b_level      0
secondary_cleaner.state.floatbank3_a_air        0
secondary_cleaner.state.floatbank3_a_level      0
secondary_cleaner.state.floatbank3_b_air        0
secondary_cleaner.state.floatbank3_b_level      0
secondary_cleaner.state.floatbank4_a_air        0
secondary_cleaner.state.floatbank4_a_level      0
secondary_cleaner.state.floatbank4_b_air        0
secondary_cleaner.state.floatbank4_b_level      0
secondary_cleaner.state.floatbank5_a_air        0
secondary_cleaner.state.floatbank5_a_level      0
secondary_cleaner.state.floatbank5_b_air        0
secondary_cleaner.state.floatbank5_b_level      0
secondary_cleaner.state.floatbank6_a_air        0
secondary_cleaner.state.floatbank6_a_level      0
dtype: int64
In [83]:
# Define a function to replace the specified values
def replace_values(df,start,stop, add):
    replace_a = df[(df['secondary_cleaner.state.floatbank2_b_air'] >= start) &
        (df['secondary_cleaner.state.floatbank2_b_air'] <= stop) &
        (df['secondary_cleaner.state.floatbank2_a_air'].isna())]
    
    replace_a_index = replace_a.index
    replace_a_values = replace_a['secondary_cleaner.state.floatbank2_b_air'] + add

    df.loc[replace_a_index,['secondary_cleaner.state.floatbank2_a_air']] = replace_a_values
In [84]:
display(gold_train_new2.isna().sum())
gold_full_new2.isna().sum()
date                                            0
final.output.recovery                           0
primary_cleaner.input.sulfate                   0
primary_cleaner.input.depressant                0
primary_cleaner.input.feed_size                 0
primary_cleaner.input.xanthate                  0
primary_cleaner.state.floatbank8_a_air          0
primary_cleaner.state.floatbank8_a_level        0
primary_cleaner.state.floatbank8_b_air          0
primary_cleaner.state.floatbank8_b_level        0
primary_cleaner.state.floatbank8_c_air          0
primary_cleaner.state.floatbank8_c_level        0
primary_cleaner.state.floatbank8_d_air          0
primary_cleaner.state.floatbank8_d_level        0
rougher.input.feed_ag                           0
rougher.input.feed_pb                           0
rougher.input.feed_rate                         0
rougher.input.feed_size                         0
rougher.input.feed_sol                          0
rougher.input.feed_au                           0
rougher.input.floatbank10_sulfate               0
rougher.input.floatbank10_xanthate              0
rougher.input.floatbank11_sulfate               0
rougher.input.floatbank11_xanthate              0
rougher.state.floatbank10_a_air                 0
rougher.state.floatbank10_a_level               0
rougher.state.floatbank10_b_air                 0
rougher.state.floatbank10_b_level               0
rougher.state.floatbank10_c_air                 0
rougher.state.floatbank10_c_level               0
rougher.state.floatbank10_d_air                 0
rougher.state.floatbank10_d_level               0
rougher.state.floatbank10_e_air                 0
rougher.state.floatbank10_e_level               0
rougher.state.floatbank10_f_air                 0
rougher.state.floatbank10_f_level               0
secondary_cleaner.state.floatbank2_a_air      217
secondary_cleaner.state.floatbank2_a_level      0
secondary_cleaner.state.floatbank2_b_air        0
secondary_cleaner.state.floatbank2_b_level      0
secondary_cleaner.state.floatbank3_a_air        0
secondary_cleaner.state.floatbank3_a_level      0
secondary_cleaner.state.floatbank3_b_air        0
secondary_cleaner.state.floatbank3_b_level      0
secondary_cleaner.state.floatbank4_a_air        0
secondary_cleaner.state.floatbank4_a_level      0
secondary_cleaner.state.floatbank4_b_air        0
secondary_cleaner.state.floatbank4_b_level      0
secondary_cleaner.state.floatbank5_a_air        0
secondary_cleaner.state.floatbank5_a_level      0
secondary_cleaner.state.floatbank5_b_air        0
secondary_cleaner.state.floatbank5_b_level      0
secondary_cleaner.state.floatbank6_a_air        0
secondary_cleaner.state.floatbank6_a_level      0
dtype: int64
Out[84]:
date                                            0
final.output.recovery                           0
primary_cleaner.input.sulfate                   0
primary_cleaner.input.depressant                0
primary_cleaner.input.feed_size                 0
primary_cleaner.input.xanthate                  0
primary_cleaner.state.floatbank8_a_air          0
primary_cleaner.state.floatbank8_a_level        0
primary_cleaner.state.floatbank8_b_air          0
primary_cleaner.state.floatbank8_b_level        0
primary_cleaner.state.floatbank8_c_air          0
primary_cleaner.state.floatbank8_c_level        0
primary_cleaner.state.floatbank8_d_air          0
primary_cleaner.state.floatbank8_d_level        0
rougher.input.feed_ag                           0
rougher.input.feed_pb                           0
rougher.input.feed_rate                         0
rougher.input.feed_size                         0
rougher.input.feed_sol                          0
rougher.input.feed_au                           0
rougher.input.floatbank10_sulfate               0
rougher.input.floatbank10_xanthate              0
rougher.input.floatbank11_sulfate               0
rougher.input.floatbank11_xanthate              0
rougher.state.floatbank10_a_air                 0
rougher.state.floatbank10_a_level               0
rougher.state.floatbank10_b_air                 0
rougher.state.floatbank10_b_level               0
rougher.state.floatbank10_c_air                 0
rougher.state.floatbank10_c_level               0
rougher.state.floatbank10_d_air                 0
rougher.state.floatbank10_d_level               0
rougher.state.floatbank10_e_air                 0
rougher.state.floatbank10_e_level               0
rougher.state.floatbank10_f_air                 0
rougher.state.floatbank10_f_level               0
secondary_cleaner.state.floatbank2_a_air      220
secondary_cleaner.state.floatbank2_a_level      0
secondary_cleaner.state.floatbank2_b_air        0
secondary_cleaner.state.floatbank2_b_level      0
secondary_cleaner.state.floatbank3_a_air        0
secondary_cleaner.state.floatbank3_a_level      0
secondary_cleaner.state.floatbank3_b_air        0
secondary_cleaner.state.floatbank3_b_level      0
secondary_cleaner.state.floatbank4_a_air        0
secondary_cleaner.state.floatbank4_a_level      0
secondary_cleaner.state.floatbank4_b_air        0
secondary_cleaner.state.floatbank4_b_level      0
secondary_cleaner.state.floatbank5_a_air        0
secondary_cleaner.state.floatbank5_a_level      0
secondary_cleaner.state.floatbank5_b_air        0
secondary_cleaner.state.floatbank5_b_level      0
secondary_cleaner.state.floatbank6_a_air        0
secondary_cleaner.state.floatbank6_a_level      0
dtype: int64
In [85]:
# When secondary_cleaner.state.floatbank2_b_air is between 29.9 - 30.096
# ~95.1% (1759) of secondary_cleaner.state.floatbank2_a_air is (~)+5.0
# The range for the gap was between 2 - 8 w/ roughly 1500 between a gap of 3-6

# Training dataset
gold_train_new2 = gold_train_new2.copy()
replace_values(gold_train_new2,29.9, 30.096, 5)

# Full dataset
gold_full_new2 = gold_full_new2.copy()
replace_values(gold_full_new2,29.9, 30.096, 5)

# Test dataset
gold_test_new2 = gold_test_new2.copy()
replace_values(gold_test_new2,29.9, 30.096, 5)
In [86]:
# When secondary_cleaner.state.floatbank2_b_air is between 30.8 - 30.9
# ~100% (7) of secondary_cleaner.state.floatbank2_a_air is (~)+1.2
# The range for the gap was between 1 - 2.2

# Training dataset
replace_values(gold_train_new2,30.8, 30.9, 1.2)

# Full dataset
replace_values(gold_full_new2,30.8, 30.9, 1.2)

# Test dataset
replace_values(gold_test_new2,30.8, 30.9, 1.2)
In [87]:
# When secondary_cleaner.state.floatbank2_b_air is between 31.9 - 32.1
# ~93.9% (509) of secondary_cleaner.state.floatbank2_a_air is (~)+3.0
# The range for the gap was between 1 - 4 

# Training dataset
replace_values(gold_train_new2,31.9, 32.1, 3)

# Full dataset
replace_values(gold_full_new2,31.9, 32.1, 3)

# Test dataset
replace_values(gold_test_new2,31.9, 32.1, 3)
In [88]:
# When secondary_cleaner.state.floatbank2_b_air is between 34.9 - 35.1
# ~100% (50) of secondary_cleaner.state.floatbank2_a_air is (~)-0.0
# The range for the gap was between < 1 with a min gap of -2.29 and a max gap of 0.39

# Training dataset
replace_values(gold_train_new2,34.9, 35.1, -0)

# Full dataset
replace_values(gold_full_new2,34.9, 35.1, -0)

# test dataset
replace_values(gold_test_new2,34.9, 35.1, -0)
In [89]:
gold_test_new2.isna().sum()
Out[89]:
date                                            0
primary_cleaner.input.sulfate                 302
primary_cleaner.input.depressant              284
primary_cleaner.input.feed_size                 0
primary_cleaner.input.xanthate                166
primary_cleaner.state.floatbank8_a_air         16
primary_cleaner.state.floatbank8_a_level       16
primary_cleaner.state.floatbank8_b_air         16
primary_cleaner.state.floatbank8_b_level       16
primary_cleaner.state.floatbank8_c_air         16
primary_cleaner.state.floatbank8_c_level       16
primary_cleaner.state.floatbank8_d_air         16
primary_cleaner.state.floatbank8_d_level       16
rougher.input.feed_ag                          16
rougher.input.feed_pb                          16
rougher.input.feed_rate                        40
rougher.input.feed_size                        22
rougher.input.feed_sol                         67
rougher.input.feed_au                          16
rougher.input.floatbank10_sulfate             254
rougher.input.floatbank10_xanthate            123
rougher.input.floatbank11_sulfate              55
rougher.input.floatbank11_xanthate            116
rougher.state.floatbank10_a_air                17
rougher.state.floatbank10_a_level              16
rougher.state.floatbank10_b_air                17
rougher.state.floatbank10_b_level              16
rougher.state.floatbank10_c_air                17
rougher.state.floatbank10_c_level              16
rougher.state.floatbank10_d_air                17
rougher.state.floatbank10_d_level              16
rougher.state.floatbank10_e_air                17
rougher.state.floatbank10_e_level              16
rougher.state.floatbank10_f_air                17
rougher.state.floatbank10_f_level              16
secondary_cleaner.state.floatbank2_a_air       20
secondary_cleaner.state.floatbank2_a_level     16
secondary_cleaner.state.floatbank2_b_air       23
secondary_cleaner.state.floatbank2_b_level     16
secondary_cleaner.state.floatbank3_a_air       34
secondary_cleaner.state.floatbank3_a_level     16
secondary_cleaner.state.floatbank3_b_air       16
secondary_cleaner.state.floatbank3_b_level     16
secondary_cleaner.state.floatbank4_a_air       16
secondary_cleaner.state.floatbank4_a_level     16
secondary_cleaner.state.floatbank4_b_air       16
secondary_cleaner.state.floatbank4_b_level     16
secondary_cleaner.state.floatbank5_a_air       16
secondary_cleaner.state.floatbank5_a_level     16
secondary_cleaner.state.floatbank5_b_air       16
secondary_cleaner.state.floatbank5_b_level     16
secondary_cleaner.state.floatbank6_a_air       16
secondary_cleaner.state.floatbank6_a_level     16
dtype: int64
In [90]:
# Remaining NaN values in secondary_cleaner.state.floatbank2_a_air 

display(gold_train_new2.isna().sum())
display(gold_full_new2.isna().sum())
date                                           0
final.output.recovery                          0
primary_cleaner.input.sulfate                  0
primary_cleaner.input.depressant               0
primary_cleaner.input.feed_size                0
primary_cleaner.input.xanthate                 0
primary_cleaner.state.floatbank8_a_air         0
primary_cleaner.state.floatbank8_a_level       0
primary_cleaner.state.floatbank8_b_air         0
primary_cleaner.state.floatbank8_b_level       0
primary_cleaner.state.floatbank8_c_air         0
primary_cleaner.state.floatbank8_c_level       0
primary_cleaner.state.floatbank8_d_air         0
primary_cleaner.state.floatbank8_d_level       0
rougher.input.feed_ag                          0
rougher.input.feed_pb                          0
rougher.input.feed_rate                        0
rougher.input.feed_size                        0
rougher.input.feed_sol                         0
rougher.input.feed_au                          0
rougher.input.floatbank10_sulfate              0
rougher.input.floatbank10_xanthate             0
rougher.input.floatbank11_sulfate              0
rougher.input.floatbank11_xanthate             0
rougher.state.floatbank10_a_air                0
rougher.state.floatbank10_a_level              0
rougher.state.floatbank10_b_air                0
rougher.state.floatbank10_b_level              0
rougher.state.floatbank10_c_air                0
rougher.state.floatbank10_c_level              0
rougher.state.floatbank10_d_air                0
rougher.state.floatbank10_d_level              0
rougher.state.floatbank10_e_air                0
rougher.state.floatbank10_e_level              0
rougher.state.floatbank10_f_air                0
rougher.state.floatbank10_f_level              0
secondary_cleaner.state.floatbank2_a_air      98
secondary_cleaner.state.floatbank2_a_level     0
secondary_cleaner.state.floatbank2_b_air       0
secondary_cleaner.state.floatbank2_b_level     0
secondary_cleaner.state.floatbank3_a_air       0
secondary_cleaner.state.floatbank3_a_level     0
secondary_cleaner.state.floatbank3_b_air       0
secondary_cleaner.state.floatbank3_b_level     0
secondary_cleaner.state.floatbank4_a_air       0
secondary_cleaner.state.floatbank4_a_level     0
secondary_cleaner.state.floatbank4_b_air       0
secondary_cleaner.state.floatbank4_b_level     0
secondary_cleaner.state.floatbank5_a_air       0
secondary_cleaner.state.floatbank5_a_level     0
secondary_cleaner.state.floatbank5_b_air       0
secondary_cleaner.state.floatbank5_b_level     0
secondary_cleaner.state.floatbank6_a_air       0
secondary_cleaner.state.floatbank6_a_level     0
dtype: int64
date                                            0
final.output.recovery                           0
primary_cleaner.input.sulfate                   0
primary_cleaner.input.depressant                0
primary_cleaner.input.feed_size                 0
primary_cleaner.input.xanthate                  0
primary_cleaner.state.floatbank8_a_air          0
primary_cleaner.state.floatbank8_a_level        0
primary_cleaner.state.floatbank8_b_air          0
primary_cleaner.state.floatbank8_b_level        0
primary_cleaner.state.floatbank8_c_air          0
primary_cleaner.state.floatbank8_c_level        0
primary_cleaner.state.floatbank8_d_air          0
primary_cleaner.state.floatbank8_d_level        0
rougher.input.feed_ag                           0
rougher.input.feed_pb                           0
rougher.input.feed_rate                         0
rougher.input.feed_size                         0
rougher.input.feed_sol                          0
rougher.input.feed_au                           0
rougher.input.floatbank10_sulfate               0
rougher.input.floatbank10_xanthate              0
rougher.input.floatbank11_sulfate               0
rougher.input.floatbank11_xanthate              0
rougher.state.floatbank10_a_air                 0
rougher.state.floatbank10_a_level               0
rougher.state.floatbank10_b_air                 0
rougher.state.floatbank10_b_level               0
rougher.state.floatbank10_c_air                 0
rougher.state.floatbank10_c_level               0
rougher.state.floatbank10_d_air                 0
rougher.state.floatbank10_d_level               0
rougher.state.floatbank10_e_air                 0
rougher.state.floatbank10_e_level               0
rougher.state.floatbank10_f_air                 0
rougher.state.floatbank10_f_level               0
secondary_cleaner.state.floatbank2_a_air      101
secondary_cleaner.state.floatbank2_a_level      0
secondary_cleaner.state.floatbank2_b_air        0
secondary_cleaner.state.floatbank2_b_level      0
secondary_cleaner.state.floatbank3_a_air        0
secondary_cleaner.state.floatbank3_a_level      0
secondary_cleaner.state.floatbank3_b_air        0
secondary_cleaner.state.floatbank3_b_level      0
secondary_cleaner.state.floatbank4_a_air        0
secondary_cleaner.state.floatbank4_a_level      0
secondary_cleaner.state.floatbank4_b_air        0
secondary_cleaner.state.floatbank4_b_level      0
secondary_cleaner.state.floatbank5_a_air        0
secondary_cleaner.state.floatbank5_a_level      0
secondary_cleaner.state.floatbank5_b_air        0
secondary_cleaner.state.floatbank5_b_level      0
secondary_cleaner.state.floatbank6_a_air        0
secondary_cleaner.state.floatbank6_a_level      0
dtype: int64
In [91]:
# Drop the rest of the NaN values as they are under 1% (98/14,434) missing

# Training Set
gold_train_new2 = gold_train_new2.dropna()

# Full Dataset
gold_full_new2 = gold_train_new2.dropna()

display(gold_train_new2.isna().sum())
gold_full_new2.isna().sum()
date                                          0
final.output.recovery                         0
primary_cleaner.input.sulfate                 0
primary_cleaner.input.depressant              0
primary_cleaner.input.feed_size               0
primary_cleaner.input.xanthate                0
primary_cleaner.state.floatbank8_a_air        0
primary_cleaner.state.floatbank8_a_level      0
primary_cleaner.state.floatbank8_b_air        0
primary_cleaner.state.floatbank8_b_level      0
primary_cleaner.state.floatbank8_c_air        0
primary_cleaner.state.floatbank8_c_level      0
primary_cleaner.state.floatbank8_d_air        0
primary_cleaner.state.floatbank8_d_level      0
rougher.input.feed_ag                         0
rougher.input.feed_pb                         0
rougher.input.feed_rate                       0
rougher.input.feed_size                       0
rougher.input.feed_sol                        0
rougher.input.feed_au                         0
rougher.input.floatbank10_sulfate             0
rougher.input.floatbank10_xanthate            0
rougher.input.floatbank11_sulfate             0
rougher.input.floatbank11_xanthate            0
rougher.state.floatbank10_a_air               0
rougher.state.floatbank10_a_level             0
rougher.state.floatbank10_b_air               0
rougher.state.floatbank10_b_level             0
rougher.state.floatbank10_c_air               0
rougher.state.floatbank10_c_level             0
rougher.state.floatbank10_d_air               0
rougher.state.floatbank10_d_level             0
rougher.state.floatbank10_e_air               0
rougher.state.floatbank10_e_level             0
rougher.state.floatbank10_f_air               0
rougher.state.floatbank10_f_level             0
secondary_cleaner.state.floatbank2_a_air      0
secondary_cleaner.state.floatbank2_a_level    0
secondary_cleaner.state.floatbank2_b_air      0
secondary_cleaner.state.floatbank2_b_level    0
secondary_cleaner.state.floatbank3_a_air      0
secondary_cleaner.state.floatbank3_a_level    0
secondary_cleaner.state.floatbank3_b_air      0
secondary_cleaner.state.floatbank3_b_level    0
secondary_cleaner.state.floatbank4_a_air      0
secondary_cleaner.state.floatbank4_a_level    0
secondary_cleaner.state.floatbank4_b_air      0
secondary_cleaner.state.floatbank4_b_level    0
secondary_cleaner.state.floatbank5_a_air      0
secondary_cleaner.state.floatbank5_a_level    0
secondary_cleaner.state.floatbank5_b_air      0
secondary_cleaner.state.floatbank5_b_level    0
secondary_cleaner.state.floatbank6_a_air      0
secondary_cleaner.state.floatbank6_a_level    0
dtype: int64
Out[91]:
date                                          0
final.output.recovery                         0
primary_cleaner.input.sulfate                 0
primary_cleaner.input.depressant              0
primary_cleaner.input.feed_size               0
primary_cleaner.input.xanthate                0
primary_cleaner.state.floatbank8_a_air        0
primary_cleaner.state.floatbank8_a_level      0
primary_cleaner.state.floatbank8_b_air        0
primary_cleaner.state.floatbank8_b_level      0
primary_cleaner.state.floatbank8_c_air        0
primary_cleaner.state.floatbank8_c_level      0
primary_cleaner.state.floatbank8_d_air        0
primary_cleaner.state.floatbank8_d_level      0
rougher.input.feed_ag                         0
rougher.input.feed_pb                         0
rougher.input.feed_rate                       0
rougher.input.feed_size                       0
rougher.input.feed_sol                        0
rougher.input.feed_au                         0
rougher.input.floatbank10_sulfate             0
rougher.input.floatbank10_xanthate            0
rougher.input.floatbank11_sulfate             0
rougher.input.floatbank11_xanthate            0
rougher.state.floatbank10_a_air               0
rougher.state.floatbank10_a_level             0
rougher.state.floatbank10_b_air               0
rougher.state.floatbank10_b_level             0
rougher.state.floatbank10_c_air               0
rougher.state.floatbank10_c_level             0
rougher.state.floatbank10_d_air               0
rougher.state.floatbank10_d_level             0
rougher.state.floatbank10_e_air               0
rougher.state.floatbank10_e_level             0
rougher.state.floatbank10_f_air               0
rougher.state.floatbank10_f_level             0
secondary_cleaner.state.floatbank2_a_air      0
secondary_cleaner.state.floatbank2_a_level    0
secondary_cleaner.state.floatbank2_b_air      0
secondary_cleaner.state.floatbank2_b_level    0
secondary_cleaner.state.floatbank3_a_air      0
secondary_cleaner.state.floatbank3_a_level    0
secondary_cleaner.state.floatbank3_b_air      0
secondary_cleaner.state.floatbank3_b_level    0
secondary_cleaner.state.floatbank4_a_air      0
secondary_cleaner.state.floatbank4_a_level    0
secondary_cleaner.state.floatbank4_b_air      0
secondary_cleaner.state.floatbank4_b_level    0
secondary_cleaner.state.floatbank5_a_air      0
secondary_cleaner.state.floatbank5_a_level    0
secondary_cleaner.state.floatbank5_b_air      0
secondary_cleaner.state.floatbank5_b_level    0
secondary_cleaner.state.floatbank6_a_air      0
secondary_cleaner.state.floatbank6_a_level    0
dtype: int64

Missing Values - Impute Test Set¶

In [92]:
# Test Sets Top Missing Values
def missing_pct(df):
    missing_values = df.isna().sum().values
    column_names = df.isna().sum().index
    return list((missing_values / 5856) * 100)

values = missing_pct(gold_test_new2)

values[:23]
Out[92]:
[0.0,
 5.157103825136613,
 4.849726775956284,
 0.0,
 2.8346994535519126,
 0.273224043715847,
 0.273224043715847,
 0.273224043715847,
 0.273224043715847,
 0.273224043715847,
 0.273224043715847,
 0.273224043715847,
 0.273224043715847,
 0.273224043715847,
 0.273224043715847,
 0.6830601092896175,
 0.3756830601092896,
 1.1441256830601094,
 0.273224043715847,
 4.337431693989071,
 2.1004098360655736,
 0.9392076502732241,
 1.9808743169398908]

Missing Values in Test Set >1%

Column Count Percent

primary_cleaner.input.sulfate | 302 | 5.16% primary_cleaner.input.depressant | 284 | 4.85% rougher.input.floatbank10_sulfate | 254 | 4.34% primary_cleaner.input.xanthate | 166 | 2.83% rougher.input.floatbank10_xanthate | 123 | 2.10% rougher.input.floatbank11_xanthate | 116 | 1.98% rougher.input.feed_sol | 67 | 1.14%

In [93]:
# Use Training data to impute Test Set

test_imp_columns = ['primary_cleaner.input.sulfate','primary_cleaner.input.depressant',
                    'rougher.input.floatbank10_sulfate','primary_cleaner.input.xanthate',
                    'rougher.input.floatbank10_xanthate','rougher.input.floatbank11_xanthate',
                    'rougher.input.feed_sol']

# Investigate the histograms from the Training dataset for the test_imp_columns
gold_train_new2[test_imp_columns].hist(figsize = [12,8])
Out[93]:
array([[<AxesSubplot:title={'center':'primary_cleaner.input.sulfate'}>,
        <AxesSubplot:title={'center':'primary_cleaner.input.depressant'}>,
        <AxesSubplot:title={'center':'rougher.input.floatbank10_sulfate'}>],
       [<AxesSubplot:title={'center':'primary_cleaner.input.xanthate'}>,
        <AxesSubplot:title={'center':'rougher.input.floatbank10_xanthate'}>,
        <AxesSubplot:title={'center':'rougher.input.floatbank11_xanthate'}>],
       [<AxesSubplot:title={'center':'rougher.input.feed_sol'}>,
        <AxesSubplot:>, <AxesSubplot:>]], dtype=object)
No description has been provided for this image
In [94]:
with pd.option_context('display.max_columns',None):
    display(gold_train_new2)
date final.output.recovery primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level primary_cleaner.state.floatbank8_c_air primary_cleaner.state.floatbank8_c_level primary_cleaner.state.floatbank8_d_air primary_cleaner.state.floatbank8_d_level rougher.input.feed_ag rougher.input.feed_pb rougher.input.feed_rate rougher.input.feed_size rougher.input.feed_sol rougher.input.feed_au rougher.input.floatbank10_sulfate rougher.input.floatbank10_xanthate rougher.input.floatbank11_sulfate rougher.input.floatbank11_xanthate rougher.state.floatbank10_a_air rougher.state.floatbank10_a_level rougher.state.floatbank10_b_air rougher.state.floatbank10_b_level rougher.state.floatbank10_c_air rougher.state.floatbank10_c_level rougher.state.floatbank10_d_air rougher.state.floatbank10_d_level rougher.state.floatbank10_e_air rougher.state.floatbank10_e_level rougher.state.floatbank10_f_air rougher.state.floatbank10_f_level secondary_cleaner.state.floatbank2_a_air secondary_cleaner.state.floatbank2_a_level secondary_cleaner.state.floatbank2_b_air secondary_cleaner.state.floatbank2_b_level secondary_cleaner.state.floatbank3_a_air secondary_cleaner.state.floatbank3_a_level secondary_cleaner.state.floatbank3_b_air secondary_cleaner.state.floatbank3_b_level secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
0 2016-01-15 00:00:00 70.541216 127.092003 10.128295 7.25 0.988759 1549.775757 -498.912140 1551.434204 -516.403442 1549.873901 -498.666595 1554.367432 -493.428131 6.100378 2.284912 523.546326 55.486599 36.808594 6.486150 11.986616 6.007990 11.836743 6.005818 999.706909 -404.066986 1603.011353 -434.715027 1602.375000 -442.204468 1598.937256 -451.294128 1404.472046 -455.462982 1416.354980 -451.939636 25.853109 -498.526489 23.893660 -501.406281 23.961798 -495.262817 21.940409 -499.340973 14.016835 -502.488007 12.099931 -504.715942 9.925633 -498.310211 8.079666 -500.470978 14.151341 -605.841980
1 2016-01-15 01:00:00 69.266198 125.629232 10.296251 7.25 1.002663 1576.166671 -500.904965 1575.950626 -499.865889 1575.994189 -499.315107 1574.479259 -498.931665 6.161113 2.266033 525.290581 57.278666 35.753385 6.478583 11.971193 6.005766 11.996163 6.012594 1000.286398 -400.065196 1600.754587 -449.953435 1600.479580 -449.830646 1600.527589 -449.953649 1399.227084 -450.869848 1399.719514 -450.119001 25.880539 -499.989656 23.889530 -500.372428 23.970550 -500.085473 22.085714 -499.446897 13.992281 -505.503262 11.950531 -501.331529 10.039245 -500.169983 7.984757 -500.582168 13.998353 -599.787184
2 2016-01-15 02:00:00 68.116445 123.819808 11.316280 7.25 0.991265 1601.556163 -499.997791 1600.386685 -500.607762 1602.003542 -500.870069 1599.541515 -499.827444 6.116455 2.159622 530.026610 57.510649 35.971630 6.362222 11.920603 6.197377 11.920305 6.204633 999.719565 -400.074028 1599.337330 -450.008530 1599.672797 -449.954491 1599.849325 -449.954185 1399.180945 -449.937588 1400.316682 -450.527147 26.005245 -499.929616 23.886657 -499.951928 23.913535 -499.442343 23.957717 -499.901982 14.015015 -502.520901 11.912783 -501.133383 10.070913 -500.129135 8.013877 -500.517572 14.028663 -601.427363
3 2016-01-15 03:00:00 68.347543 122.270188 11.322140 7.25 0.996739 1599.968720 -500.951778 1600.659236 -499.677094 1600.304144 -500.727997 1600.449520 -500.052575 6.043309 2.037807 542.590390 57.792734 36.862241 6.118189 11.630094 6.203177 11.692450 6.196578 999.814770 -400.200179 1600.059442 -450.619948 1600.012842 -449.910497 1597.725177 -450.130127 1400.943157 -450.030142 1400.234743 -449.790835 25.942508 -499.176749 23.955516 -499.848796 23.966838 -500.008812 23.954443 -499.944710 14.036510 -500.857308 11.999550 -501.193686 9.970366 -499.201640 7.977324 -500.255908 14.005551 -599.996129
4 2016-01-15 04:00:00 66.927016 117.988169 11.913613 7.25 1.009869 1601.339707 -498.975456 1601.437854 -500.323246 1599.581894 -500.888152 1602.649541 -500.593010 6.060915 1.786875 540.531893 56.047189 34.347666 5.663707 10.957755 6.198826 10.960521 6.194897 999.678690 -399.752729 1600.208824 -449.599614 1600.357732 -450.034364 1599.759049 -449.909799 1401.560902 -448.877187 1401.160227 -450.407128 26.024787 -500.279091 23.955345 -500.593614 23.985703 -500.083811 23.958945 -499.990309 14.027298 -499.838632 11.953070 -501.053894 9.925709 -501.686727 7.894242 -500.356035 13.996647 -601.496691
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
16855 2018-08-18 06:59:59 73.755150 123.381787 8.028927 6.50 1.304232 1648.421193 -400.382169 1648.742005 -400.359661 1648.578230 -399.363624 1648.833984 -399.669220 6.091855 4.617558 560.889077 85.718304 37.369774 5.335862 7.762770 9.158609 7.766744 9.156069 1199.245914 -300.845518 1149.807890 -498.789721 1047.963596 -498.413079 946.640977 -499.152477 849.664935 -499.214461 849.758091 -497.448664 35.043205 -499.045671 29.906659 -499.979939 26.002402 -499.953431 22.987238 -499.967351 23.031497 -501.167942 20.007571 -499.740028 18.006038 -499.834374 13.001114 -500.155694 20.007840 -501.296428
16856 2018-08-18 07:59:59 69.049291 120.878188 7.962636 6.50 1.302419 1649.820162 -399.930973 1649.357538 -399.721222 1648.656192 -401.195834 1649.725133 -400.636306 6.121323 4.144989 559.031805 119.499241 38.591551 4.838619 7.356687 9.304952 7.095508 9.297924 1196.569267 -299.512227 1147.675196 -500.608341 1048.565741 -500.932810 949.773589 -500.023144 848.515225 -500.289405 850.013123 -496.822119 35.026062 -499.891945 29.921795 -499.949663 26.031747 -500.384612 22.991058 -500.079590 22.960095 -501.612783 20.035660 -500.251357 17.998535 -500.395178 12.954048 -499.895163 19.968498 -501.041608
16857 2018-08-18 08:59:59 67.002189 105.666118 7.955111 6.50 1.315926 1649.166761 -399.888631 1649.196904 -399.677571 1647.896999 -399.988275 1649.772714 -399.831902 5.970515 4.020002 555.682872 122.262690 40.074026 4.525061 6.586020 9.299606 6.584130 9.300133 1204.866639 -299.235675 1149.942902 -501.717903 1049.604390 -500.549053 952.702732 -502.352296 849.016017 -500.505677 850.455635 -506.897968 35.003586 -501.083794 29.990533 -611.855898 25.948429 -500.067268 22.968268 -499.839442 23.015718 -501.711599 19.951231 -499.857027 18.019543 -500.451156 13.023431 -499.914391 19.990885 -501.518452
16858 2018-08-18 09:59:59 65.523246 98.880538 7.984164 6.50 1.241969 1646.547763 -398.977083 1648.212240 -400.383265 1648.917387 -399.521344 1651.498591 -399.745329 6.048130 3.902537 544.731687 123.742430 39.713906 4.362781 6.210119 9.297709 6.209517 9.297194 1201.904177 -299.907308 1154.087804 -500.036580 1054.009756 -500.237335 944.138793 -496.866953 851.589767 -499.040466 851.345606 -499.122561 34.980742 -498.131002 29.968453 -586.013330 25.971737 -499.608392 22.958448 -499.821308 23.024963 -501.153409 20.054122 -500.314711 17.979515 -499.272871 12.992404 -499.976268 20.013986 -500.625471
16859 2018-08-18 10:59:59 70.281454 95.248427 8.078957 6.50 1.283045 1648.759906 -399.862053 1650.135395 -399.957321 1648.831890 -400.586116 1649.464582 -400.673303 6.158718 3.875727 555.820208 94.544358 39.135119 4.365491 6.146982 9.308612 6.168939 9.309852 1196.238112 -299.862743 1147.248241 -500.363165 1047.279065 -500.354091 948.756608 -498.439416 849.441918 -499.255503 850.112246 -499.407112 34.940919 -500.150510 30.031867 -500.328335 26.033990 -500.147792 22.952306 -500.037678 23.018622 -500.492702 20.020205 -500.220296 17.963512 -499.939490 12.990306 -500.080993 19.990336 -499.191575

14336 rows × 54 columns

In [95]:
# Define a function to replace df rows with the imputation
def replace_rows_median(df, df_imp, column):
    
    imputation = df_imp[column].median()
    values = df[column].fillna(imputation)
    index = values.index
    
    df.loc[index,[column]] = values
In [96]:
# replace primary_cleaner.input.sulfate with the median from the training set
replace_rows_median(gold_test_new2, gold_train_new2,'primary_cleaner.input.sulfate')
In [97]:
# Inspect the training set for remaining NaN values for the Test Set
gold_train_new2[test_imp_columns].hist(figsize = [12,8])
Out[97]:
array([[<AxesSubplot:title={'center':'primary_cleaner.input.sulfate'}>,
        <AxesSubplot:title={'center':'primary_cleaner.input.depressant'}>,
        <AxesSubplot:title={'center':'rougher.input.floatbank10_sulfate'}>],
       [<AxesSubplot:title={'center':'primary_cleaner.input.xanthate'}>,
        <AxesSubplot:title={'center':'rougher.input.floatbank10_xanthate'}>,
        <AxesSubplot:title={'center':'rougher.input.floatbank11_xanthate'}>],
       [<AxesSubplot:title={'center':'rougher.input.feed_sol'}>,
        <AxesSubplot:>, <AxesSubplot:>]], dtype=object)
No description has been provided for this image
In [98]:
# Input the median for primary_cleaner.input.depressant and primary_cleaner.input.xanthate

replace_rows_median(gold_test_new2,gold_train_new2,'primary_cleaner.input.depressant')
replace_rows_median(gold_test_new2,gold_train_new2,'primary_cleaner.input.xanthate')
In [99]:
# Look at rougher.input.floatbank10_xanthate and rougher.input.floatbank11_xanthate for the test set

xan_test = gold_test_new2[['rougher.input.floatbank10_xanthate','rougher.input.floatbank11_xanthate']]
xan_test.hist()

xan_test_values = xan_test[(xan_test['rougher.input.floatbank10_xanthate'].isna()) & (xan_test['rougher.input.floatbank11_xanthate'].notna())]
xan_test11_values = xan_test_values['rougher.input.floatbank11_xanthate']
xan_test11_index = xan_test11_values.index
gold_test_new2.loc[xan_test11_index,['rougher.input.floatbank10_xanthate']] = xan_test11_values
No description has been provided for this image
In [100]:
# Input the median from the training set in the remaining values in fb10 and fb11_xanthate missing NaN values
replace_rows_median(gold_test_new2,gold_train_new2,'rougher.input.floatbank10_xanthate')
replace_rows_median(gold_test_new2,gold_train_new2,'rougher.input.floatbank11_xanthate')

gold_test_new2[test_imp_columns].hist()
gold_test_new2.isna().sum()
Out[100]:
date                                            0
primary_cleaner.input.sulfate                   0
primary_cleaner.input.depressant                0
primary_cleaner.input.feed_size                 0
primary_cleaner.input.xanthate                  0
primary_cleaner.state.floatbank8_a_air         16
primary_cleaner.state.floatbank8_a_level       16
primary_cleaner.state.floatbank8_b_air         16
primary_cleaner.state.floatbank8_b_level       16
primary_cleaner.state.floatbank8_c_air         16
primary_cleaner.state.floatbank8_c_level       16
primary_cleaner.state.floatbank8_d_air         16
primary_cleaner.state.floatbank8_d_level       16
rougher.input.feed_ag                          16
rougher.input.feed_pb                          16
rougher.input.feed_rate                        40
rougher.input.feed_size                        22
rougher.input.feed_sol                         67
rougher.input.feed_au                          16
rougher.input.floatbank10_sulfate             254
rougher.input.floatbank10_xanthate              0
rougher.input.floatbank11_sulfate              55
rougher.input.floatbank11_xanthate              0
rougher.state.floatbank10_a_air                17
rougher.state.floatbank10_a_level              16
rougher.state.floatbank10_b_air                17
rougher.state.floatbank10_b_level              16
rougher.state.floatbank10_c_air                17
rougher.state.floatbank10_c_level              16
rougher.state.floatbank10_d_air                17
rougher.state.floatbank10_d_level              16
rougher.state.floatbank10_e_air                17
rougher.state.floatbank10_e_level              16
rougher.state.floatbank10_f_air                17
rougher.state.floatbank10_f_level              16
secondary_cleaner.state.floatbank2_a_air       20
secondary_cleaner.state.floatbank2_a_level     16
secondary_cleaner.state.floatbank2_b_air       23
secondary_cleaner.state.floatbank2_b_level     16
secondary_cleaner.state.floatbank3_a_air       34
secondary_cleaner.state.floatbank3_a_level     16
secondary_cleaner.state.floatbank3_b_air       16
secondary_cleaner.state.floatbank3_b_level     16
secondary_cleaner.state.floatbank4_a_air       16
secondary_cleaner.state.floatbank4_a_level     16
secondary_cleaner.state.floatbank4_b_air       16
secondary_cleaner.state.floatbank4_b_level     16
secondary_cleaner.state.floatbank5_a_air       16
secondary_cleaner.state.floatbank5_a_level     16
secondary_cleaner.state.floatbank5_b_air       16
secondary_cleaner.state.floatbank5_b_level     16
secondary_cleaner.state.floatbank6_a_air       16
secondary_cleaner.state.floatbank6_a_level     16
dtype: int64
No description has been provided for this image
In [101]:
# Fix fb_10_sulfate in a similar way to the training set; use sulfate 11 (2 - 5.9) 
# For the trainig set we used sulfate 11 (5.9 - 13.1)
# However, the majority of the training sets NaN values were not outside of this range

print(f"Sulfate 11 between 2 - 5.9 & Differences Greater Than 16")
print("-----------------------------------")
print("Median")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 5.9) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 2) & (sulfate_train['sulfate_difference'] > 16)].median())
print()
print("Mean")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 5.9) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 2) & (sulfate_train['sulfate_difference'] > 16)].mean())
print()
print("Min")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 5.9) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 2) & (sulfate_train['sulfate_difference'] > 16)].min())
print()
print("Max")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 5.9) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 2) & (sulfate_train['sulfate_difference'] > 16)].max())
print()
print("Length")
display(len(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 5.9) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 2) & (sulfate_train['sulfate_difference'] > 16)]))
print()
print()
print("Histogram: Sulfate 11 Between 2 - 5.9; Difference > 16")
sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 5.9) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 2) & (sulfate_train['sulfate_difference'] > 16)].hist()
plt.show()

print()
print()
print()
print(f"Sulfate 11 between 2 - 5.9 & Difference Between 6 - 16")
print("-----------------------------------")
print("Median")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 5.9) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 2) & (sulfate_train['sulfate_difference'] <= 16) & 
    (sulfate_train['sulfate_difference'] > 6)].median())
print()
print("Mean")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 5.9) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 2) & (sulfate_train['sulfate_difference'] <= 16) 
    & (sulfate_train['sulfate_difference'] > 6)].mean())
print()
print("Min")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 5.9) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 2) & (sulfate_train['sulfate_difference'] <= 16) & 
    (sulfate_train['sulfate_difference'] > 6)].min())
print()
print("Max")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 5.9) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 2) & (sulfate_train['sulfate_difference'] <= 16) 
    & (sulfate_train['sulfate_difference'] > 6)].max())
print()
print("Length")
display(len(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 5.9) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 2) & (sulfate_train['sulfate_difference'] <= 16) & 
    (sulfate_train['sulfate_difference'] > 6)]))
print()
print()
print("Histogram: Sulfate 11 Between 2 - 5.9; Difference Between 6 - 16")
sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 5.9) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 2) & (sulfate_train['sulfate_difference'] <= 16) & 
    (sulfate_train['sulfate_difference'] > 6)].hist()
plt.show()


print()
print()
print()
print(f"Sulfate 11 between 2 - 5.9 & Differences Less Than 6")
print("-----------------------------------")
print("Median")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 5.9) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 2) & (sulfate_train['sulfate_difference'] <= 6)].median())
print()
print("Mean")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 5.9) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 2) & (sulfate_train['sulfate_difference'] <= 6)].mean())
print()
print("Min")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 5.9) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 2) & (sulfate_train['sulfate_difference'] <= 6)].min())
print()
print("Max")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 5.9) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 2) & (sulfate_train['sulfate_difference'] <= 6)].max())
print()
print("Length")
display(len(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 5.9) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 2) & (sulfate_train['sulfate_difference'] <= 6)]))
print()
print()
print("Histogram: Sulfate 11 Between 2 - 5.9; Difference < 6")
sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 5.9) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 2) & (sulfate_train['sulfate_difference'] <= 6)].hist()
plt.show()
Sulfate 11 between 2 - 5.9 & Differences Greater Than 16
-----------------------------------
Median
rougher.input.floatbank10_sulfate    20.001514
rougher.input.floatbank11_sulfate     3.326154
sulfate_difference                   16.675361
dtype: float64
Mean
rougher.input.floatbank10_sulfate    20.001514
rougher.input.floatbank11_sulfate     3.326154
sulfate_difference                   16.675361
dtype: float64
Min
rougher.input.floatbank10_sulfate    20.001514
rougher.input.floatbank11_sulfate     3.326154
sulfate_difference                   16.675361
dtype: float64
Max
rougher.input.floatbank10_sulfate    20.001514
rougher.input.floatbank11_sulfate     3.326154
sulfate_difference                   16.675361
dtype: float64
Length
1

Histogram: Sulfate 11 Between 2 - 5.9; Difference > 16
No description has been provided for this image


Sulfate 11 between 2 - 5.9 & Difference Between 6 - 16
-----------------------------------
Median
rougher.input.floatbank10_sulfate    13.972227
rougher.input.floatbank11_sulfate     4.594282
sulfate_difference                    9.814021
dtype: float64
Mean
rougher.input.floatbank10_sulfate    13.524470
rougher.input.floatbank11_sulfate     4.273113
sulfate_difference                    9.251357
dtype: float64
Min
rougher.input.floatbank10_sulfate    11.004907
rougher.input.floatbank11_sulfate     2.070688
sulfate_difference                    6.503352
dtype: float64
Max
rougher.input.floatbank10_sulfate    15.148519
rougher.input.floatbank11_sulfate     5.833199
sulfate_difference                   10.874033
dtype: float64
Length
4

Histogram: Sulfate 11 Between 2 - 5.9; Difference Between 6 - 16
No description has been provided for this image


Sulfate 11 between 2 - 5.9 & Differences Less Than 6
-----------------------------------
Median
rougher.input.floatbank10_sulfate    4.865535
rougher.input.floatbank11_sulfate    4.837483
sulfate_difference                   0.000150
dtype: float64
Mean
rougher.input.floatbank10_sulfate    4.716024
rougher.input.floatbank11_sulfate    4.668578
sulfate_difference                   0.047447
dtype: float64
Min
rougher.input.floatbank10_sulfate    0.260698
rougher.input.floatbank11_sulfate    2.389089
sulfate_difference                  -5.575302
dtype: float64
Max
rougher.input.floatbank10_sulfate    8.511460
rougher.input.floatbank11_sulfate    5.893169
sulfate_difference                   5.410969
dtype: float64
Length
280

Histogram: Sulfate 11 Between 2 - 5.9; Difference < 6
No description has been provided for this image
In [102]:
# Fix fb_10_sulfate in a similar way to the training set; use sulfate 11 (0 - 2) 


print(f"Sulfate 11 between 0 - 2 & Differences Greater Than 19")
print("-----------------------------------")
print("Median")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 2) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 0) & (sulfate_train['sulfate_difference'] > 19)].median())
print()
print("Mean")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 2) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 0) & (sulfate_train['sulfate_difference'] > 19)].mean())
print()
print("Min")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 2) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 0) & (sulfate_train['sulfate_difference'] > 19)].min())
print()
print("Max")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 2) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 0) & (sulfate_train['sulfate_difference'] > 19)].max())
print()
print("Length")
display(len(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 2) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 0) & (sulfate_train['sulfate_difference'] > 19)]))
print()
print()
print("Histogram: Sulfate 11 Between 0 - 2; Difference > 19")
sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 2) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 0) & (sulfate_train['sulfate_difference'] > 19)].hist()
plt.show()

print()
print()
print()
print(f"Sulfate 11 between 0 - 2 & Difference Between 5 - 19")
print("-----------------------------------")
print("Median")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 2) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 19) & 
    (sulfate_train['sulfate_difference'] > 5)].median())
print()
print("Mean")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 2) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 19) 
    & (sulfate_train['sulfate_difference'] > 5)].mean())
print()
print("Min")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 2) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 19) & 
    (sulfate_train['sulfate_difference'] > 5)].min())
print()
print("Max")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 2) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 19) 
    & (sulfate_train['sulfate_difference'] > 5)].max())
print()
print("Length")
display(len(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 2) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 19) & 
    (sulfate_train['sulfate_difference'] > 5)]))
print()
print()
print("Histogram: Sulfate 11 Between 0 - 2; Difference Between 5 - 19")
sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 2) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 19) & 
    (sulfate_train['sulfate_difference'] > 5)].hist()
plt.show()


print()
print()
print()
print(f"Sulfate 11 between 0 - 2 & Differences Less Than 5")
print("-----------------------------------")
print("Median")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 2) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 5)].median())
print()
print("Mean")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 2) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 5)].mean())
print()
print("Min")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 2) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 5)].min())
print()
print("Max")
display(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 2) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 5)].max())
print()
print("Length")
display(len(sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 2) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 5)]))
print()
print()
print("Histogram: Sulfate 11 Between 0 - 2; Difference < 5")
sulfate_train[(sulfate_train['rougher.input.floatbank11_sulfate'] <= 2) & 
    (sulfate_train['rougher.input.floatbank11_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 5)].hist()
Sulfate 11 between 0 - 2 & Differences Greater Than 19
-----------------------------------
Median
rougher.input.floatbank10_sulfate    21.996584
rougher.input.floatbank11_sulfate     0.013980
sulfate_difference                   21.980544
dtype: float64
Mean
rougher.input.floatbank10_sulfate    21.215349
rougher.input.floatbank11_sulfate     0.013487
sulfate_difference                   21.201862
dtype: float64
Min
rougher.input.floatbank10_sulfate    19.998256
rougher.input.floatbank11_sulfate     0.001877
sulfate_difference                   19.971806
dtype: float64
Max
rougher.input.floatbank10_sulfate    23.748453
rougher.input.floatbank11_sulfate     0.028495
sulfate_difference                   23.746576
dtype: float64
Length
24

Histogram: Sulfate 11 Between 0 - 2; Difference > 19
No description has been provided for this image


Sulfate 11 between 0 - 2 & Difference Between 5 - 19
-----------------------------------
Median
rougher.input.floatbank10_sulfate    12.998672
rougher.input.floatbank11_sulfate     0.029660
sulfate_difference                   12.962889
dtype: float64
Mean
rougher.input.floatbank10_sulfate    12.538645
rougher.input.floatbank11_sulfate     0.039859
sulfate_difference                   12.498786
dtype: float64
Min
rougher.input.floatbank10_sulfate    5.997823
rougher.input.floatbank11_sulfate    0.000086
sulfate_difference                   5.808519
dtype: float64
Max
rougher.input.floatbank10_sulfate    18.003129
rougher.input.floatbank11_sulfate     1.582220
sulfate_difference                   17.999669
dtype: float64
Length
379

Histogram: Sulfate 11 Between 0 - 2; Difference Between 5 - 19
No description has been provided for this image


Sulfate 11 between 0 - 2 & Differences Less Than 5
-----------------------------------
Median
rougher.input.floatbank10_sulfate    0.354322
rougher.input.floatbank11_sulfate    0.159014
sulfate_difference                  -0.000815
dtype: float64
Mean
rougher.input.floatbank10_sulfate    0.565230
rougher.input.floatbank11_sulfate    0.379218
sulfate_difference                   0.186012
dtype: float64
Min
rougher.input.floatbank10_sulfate    0.001472
rougher.input.floatbank11_sulfate    0.000049
sulfate_difference                  -0.371064
dtype: float64
Max
rougher.input.floatbank10_sulfate    4.167661
rougher.input.floatbank11_sulfate    1.987143
sulfate_difference                   4.048312
dtype: float64
Length
25

Histogram: Sulfate 11 Between 0 - 2; Difference < 5
Out[102]:
array([[<AxesSubplot:title={'center':'rougher.input.floatbank10_sulfate'}>,
        <AxesSubplot:title={'center':'rougher.input.floatbank11_sulfate'}>],
       [<AxesSubplot:title={'center':'sulfate_difference'}>,
        <AxesSubplot:>]], dtype=object)
No description has been provided for this image
In [103]:
# Define a function to replace the specified values
def replace_add(df, column_1, column_2, start,stop, add):
    replace = df[(df[column_2] >= start) &
        (df[column_2] <= stop) &
        (df[column_1].isna())]
    
    replace_index = replace.index
    replace_values = replace[column_2] + add

    df.loc[replace_index,[column_1]] = replace_values
In [104]:
# When Sulfate 11 Between 2 - 5.9 & Difference < 6
# Sulfate 10 ~ +0.0 difference ~98.2% of the time (280/285)
# column_1 = 10_sulfate
replace_add(gold_test_new2, 'rougher.input.floatbank10_sulfate','rougher.input.floatbank11_sulfate', 2, 5.9, 0.0)
In [105]:
# When Sulfate 11 Between 0 - 2 & Difference Between 5 - 19
# Sulfate 10 ~ +12.96 difference ~88.6% of the time (379/428)
# column_1 = 10_sulfate

replace_add(gold_test_new2, 'rougher.input.floatbank10_sulfate','rougher.input.floatbank11_sulfate', 0, 2, 12.96)

Imputation Strategy for rougher.input.floatbank10_sulfate

Analysis of the relationship between rougher.input.floatbank10_sulfate and rougher.input.floatbank11_sulfate reveals two distinct behavioral patterns based on sulfate 11 concentration, enabling accurate conditional imputation for the test set.


Statistical Comparison

Pattern 1: Sulfate 11 Between 2.0 - 5.9 (Near-Perfect Correlation)

Metric floatbank10_sulfate floatbank11_sulfate Difference Key Finding
Median 4.87 4.84 0.00015 Nearly identical values
Mean 4.72 4.67 0.047 Minimal systematic bias
Min 0.26 2.39 -5.58 Occasional outliers
Max 8.51 5.89 5.41 Occasional outliers
  • Observations: 280/285 (98.2% for this range)

image.png

Pattern 2: Sulfate 11 Between 0.0 - 2.0 (Large Offset)

Metric floatbank10_sulfate floatbank11_sulfate Difference Key Finding
Median 12.99 0.030 12.96 Consistent +13 offset
Mean 12.54 0.040 12.50 Stable relationship
Min 6.00 0.000086 5.81 Lower bound maintained
Max 18.00 1.58 18.00 Upper bound maintained
  • Observations: 379/428 (88.6% for this range)

image.png


Key Findings

  • Conditional relationship identified: The relationship between floatbank 10 and 11 sulfate measurements changes dramatically based on sulfate 11 concentration
  • Pattern 1 (Mid-Range, 2.0-5.9): Median difference of 0.00015 confirms floatbank10 ≈ floatbank11 under normal synchronized conditions
  • Pattern 2 (Low Range, 0.0-2.0): Median difference of 12.96 reveals floatbank 10 maintains a consistent +13 unit offset, suggesting different stages
  • Strong empirical support: 659 total training observations (280 + 379) provide high confidence in pattern reliability
  • High coverage in low range: 88.6% of observations (379/428) in the 0-2 range follow the +12.96 offset pattern

Recommended imputation approach: Apply conditional imputation to the test set based on floatbank11_sulfate value:

  • When sulfate 11 [2.0 - 5.9]: floatbank10_sulfate = floatbank11_sulfate + 0.0
  • When sulfate 11 [0.0 - 2.0]: floatbank10_sulfate = floatbank11_sulfate + 12.96

The strong patterns observed in training data ensure this relationship-based imputation will generalize well to the test set and accurately represent underlying process dynamics.

In [106]:
# Look into fb_11_sulfate


# Fix fb_11_sulfate in a similar way to the training set; use sulfate 10 (0 - 11) 
# For the trainig set we used sulfate 11 (5.9 - 13.1)
# However, the majority of the training sets NaN values were not outside of this range


print(f"Sulfate 10 between 0 - 11 & Differences Greater Than 16")
print("-----------------------------------")
print("Median")
display(sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'] <= 11) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] > 0) & (sulfate_train['sulfate_difference'] > 16)].median())
print()
print("Mean")
display(sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'] <= 11) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] > 0) & (sulfate_train['sulfate_difference'] > 16)].mean())
print()
print("Min")
display(sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'] <= 11) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] > 0) & (sulfate_train['sulfate_difference'] > 16)].min())
print()
print("Max")
display(sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'] <= 11) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] > 0) & (sulfate_train['sulfate_difference'] > 16)].max())
print()
print("Length")
display(len(sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'] <= 11) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] > 0) & (sulfate_train['sulfate_difference'] > 16)]))
print()
print()
print("Histogram: Sulfate 10 Between 0 - 11; Difference > 16")
sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'] <= 11) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] > 0) & (sulfate_train['sulfate_difference'] > 16)].hist()
plt.show()

print()
print()
print()
print(f"Sulfate 10 between 0 - 11 & Difference Between 6 - 16")
print("-----------------------------------")
print("Median")
display(sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'] <= 11) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 16) & 
    (sulfate_train['sulfate_difference'] > 6)].median())
print()
print("Mean")
display(sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'] <= 11) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 16) 
    & (sulfate_train['sulfate_difference'] > 6)].mean())
print()
print("Min")
display(sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'] <= 11) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 16) & 
    (sulfate_train['sulfate_difference'] > 6)].min())
print()
print("Max")
display(sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'] <= 11) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 16) 
    & (sulfate_train['sulfate_difference'] > 6)].max())
print()
print("Length")
display(len(sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'] <= 11) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 16) & 
    (sulfate_train['sulfate_difference'] > 6)]))
print()
print()
print("Histogram: Sulfate 10 Between 0 - 11; Difference Between 6 - 16")
sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'] <= 11) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 16) & 
    (sulfate_train['sulfate_difference'] > 6)].hist()
plt.show()


print()
print()
print()
print(f"Sulfate 10 between 0 - 11 & Differences Less Than 6")
print("-----------------------------------")
print("Median")
display(sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'] <= 11) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 6)].median())
print()
print("Mean")
display(sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'] <= 11) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 6)].mean())
print()
print("Min")
display(sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'] <= 11) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 6)].min())
print()
print("Max")
display(sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'] <= 11) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 6)].max())
print()
print("Length")
display(len(sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'] <= 11) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 6)]))
print()
print()
print("Histogram: Sulfate 10 Between 0 - 11; Difference < 6")
sulfate_train[(sulfate_train['rougher.input.floatbank10_sulfate'] <= 11) & 
    (sulfate_train['rougher.input.floatbank10_sulfate'] > 0) & (sulfate_train['sulfate_difference'] <= 6)].hist()
plt.show()
Sulfate 10 between 0 - 11 & Differences Greater Than 16
-----------------------------------
Median
rougher.input.floatbank10_sulfate   NaN
rougher.input.floatbank11_sulfate   NaN
sulfate_difference                  NaN
dtype: float64
Mean
rougher.input.floatbank10_sulfate   NaN
rougher.input.floatbank11_sulfate   NaN
sulfate_difference                  NaN
dtype: float64
Min
rougher.input.floatbank10_sulfate   NaN
rougher.input.floatbank11_sulfate   NaN
sulfate_difference                  NaN
dtype: float64
Max
rougher.input.floatbank10_sulfate   NaN
rougher.input.floatbank11_sulfate   NaN
sulfate_difference                  NaN
dtype: float64
Length
0

Histogram: Sulfate 10 Between 0 - 11; Difference > 16
No description has been provided for this image


Sulfate 10 between 0 - 11 & Difference Between 6 - 16
-----------------------------------
Median
rougher.input.floatbank10_sulfate    9.998971
rougher.input.floatbank11_sulfate    0.024849
sulfate_difference                   9.974456
dtype: float64
Mean
rougher.input.floatbank10_sulfate    9.458411
rougher.input.floatbank11_sulfate    0.045602
sulfate_difference                   9.412809
dtype: float64
Min
rougher.input.floatbank10_sulfate    6.805585
rougher.input.floatbank11_sulfate    0.000086
sulfate_difference                   6.781141
dtype: float64
Max
rougher.input.floatbank10_sulfate    10.999767
rougher.input.floatbank11_sulfate     1.222696
sulfate_difference                   10.974617
dtype: float64
Length
52

Histogram: Sulfate 10 Between 0 - 11; Difference Between 6 - 16
No description has been provided for this image


Sulfate 10 between 0 - 11 & Differences Less Than 6
-----------------------------------
Median
rougher.input.floatbank10_sulfate    9.427303
rougher.input.floatbank11_sulfate    9.429677
sulfate_difference                  -0.000142
dtype: float64
Mean
rougher.input.floatbank10_sulfate    8.971865
rougher.input.floatbank11_sulfate    8.985439
sulfate_difference                  -0.013574
dtype: float64
Min
rougher.input.floatbank10_sulfate     0.001164
rougher.input.floatbank11_sulfate     0.000049
sulfate_difference                  -12.977835
dtype: float64
Max
rougher.input.floatbank10_sulfate    10.999994
rougher.input.floatbank11_sulfate    14.501981
sulfate_difference                    5.990988
dtype: float64
Length
6076

Histogram: Sulfate 10 Between 0 - 11; Difference < 6
No description has been provided for this image
In [107]:
# When Sulfate 10 Between 0 - 11 & Difference < 6
# Sulfate 11 ~ -0.0 difference ~99.2% of the time (6076/6128)
# column_1 = 11_sulfate
replace_add(gold_test_new2, 'rougher.input.floatbank11_sulfate','rougher.input.floatbank10_sulfate', 0, 11, -0.0)

Imputation Strategy for rougher.input.floatbank11_sulfate: Test Set

Analysis of the relationship between rougher.input.floatbank11_sulfate and rougher.input.floatbank10_sulfate reveals a dominant pattern when sulfate 10 is in the low-to-mid range (0-11), enabling accurate imputation for the test set.

image.png


Statistical Comparison

Pattern 1: Sulfate 10 Between 0 - 11, Difference < 6 (Near-Perfect Correlation)

Metric floatbank11_sulfate floatbank10_sulfate Difference Key Finding
Median 9.43 9.43 -0.00014 Nearly identical values
Mean 8.99 8.97 -0.014 Minimal systematic bias
Min 0.000049 0.0012 -12.98 Occasional outliers
Max 14.50 11.00 5.99 Occasional outliers
Observations 6,076 6,076 - Excellent sample size

Observations: 6,076/6128 - 99.2% in this range

Pattern 2: Sulfate 10 Between 0 - 11, Difference Between 6 - 16 (Large Offset)

Metric floatbank11_sulfate floatbank10_sulfate Difference Key Finding
Median 0.025 10.00 9.97 Floatbank 11 near zero
Mean 0.046 9.46 9.41 Floatbank 11 near zero
Min 0.000086 6.81 6.78 Lower bound maintained
Max 1.22 11.00 10.97 Upper bound maintained

Observations: 52/6128


Key Findings

The relationship between floatbank 11 and 10 sulfate measurements is dominated by a single strong pattern: when sulfate 10 is between 0-11, the measurements are nearly identical (difference ≈ -0.00014). With 6,076 training observations representing 99.2% coverage (6,076/6,128) in this range, the near-perfect correlation provides extremely high confidence for imputation. The secondary pattern (52 observations with difference 6-16) represents less than 1% of cases and involves floatbank 11 values near zero, making it unsuitable for reliable imputation.


Recommended imputation approach: Apply simple imputation to the test set based on floatbank10_sulfate value when it falls between 0-11:

  • floatbank11_sulfate = floatbank10_sulfate - 0.0

This strategy leverages the overwhelmingly dominant pattern (99.2% coverage) where the two measurements are synchronized. The exceptional sample size and near-perfect correlation ensure this imputation will accurately represent the flotation process for virtually all missing values in the test set.

In [108]:
# rougher.input.feed_sol compared with rougher.input.feed_size
temp = gold_train_new2[['rougher.input.feed_size','rougher.input.feed_sol']].copy()
tempy = temp['rougher.input.feed_size'] - temp['rougher.input.feed_sol']
temp['size_sol_difference'] = tempy

temp.hist()
Out[108]:
array([[<AxesSubplot:title={'center':'rougher.input.feed_size'}>,
        <AxesSubplot:title={'center':'rougher.input.feed_sol'}>],
       [<AxesSubplot:title={'center':'size_sol_difference'}>,
        <AxesSubplot:>]], dtype=object)
No description has been provided for this image
In [109]:
print(f"feed_size between 24 - 30 & Differences Greater Than 12")
print("-----------------------------------")
print("Median")
display(temp[(temp['rougher.input.feed_size'] <= 30) & 
    (temp['rougher.input.feed_size'] > 24) & (temp['size_sol_difference'] > 12)].median())
print()
print("Mean")
display(temp[(temp['rougher.input.feed_size'] <= 30) & 
    (temp['rougher.input.feed_size'] > 24) & (temp['size_sol_difference'] > 12)].mean())
print()
print("Min")
display(temp[(temp['rougher.input.feed_size'] <= 30) & 
    (temp['rougher.input.feed_size'] > 24) & (temp['size_sol_difference'] > 12)].min())
print()
print("Max")
display(temp[(temp['rougher.input.feed_size'] <= 30) & 
    (temp['rougher.input.feed_size'] > 24) & (temp['size_sol_difference'] > 12)].max())
print()
print("Length")
display(len(temp[(temp['rougher.input.feed_size'] <= 30) & 
    (temp['rougher.input.feed_size'] > 24) & (temp['size_sol_difference'] > 12)]))
print()
print()
print("Histogram: feed_size Between 24 - 30; Difference > 12")
temp[(temp['rougher.input.feed_size'] <= 30) & 
    (temp['rougher.input.feed_size'] > 24) & (temp['size_sol_difference'] > 12)].hist()
plt.show()
print()
print()
print()
print(f"feed_size between 24 - 30 & Difference Between -5 - 12")
print("-----------------------------------")
print("Median")
display(temp[(temp['rougher.input.feed_size'] <= 30) & 
    (temp['rougher.input.feed_size'] > 24) & (temp['size_sol_difference'] <= 12) & 
    (temp['size_sol_difference'] > -5)].median())
print()
print("Mean")
display(temp[(temp['rougher.input.feed_size'] <= 30) & 
    (temp['rougher.input.feed_size'] > 24) & (temp['size_sol_difference'] <= 12) 
    & (temp['size_sol_difference'] > -5)].mean())
print()
print("Min")
display(temp[(temp['rougher.input.feed_size'] <= 30) & 
    (temp['rougher.input.feed_size'] > 24) & (temp['size_sol_difference'] <= 12) & 
    (temp['size_sol_difference'] > -5)].min())
print()
print("Max")
display(temp[(temp['rougher.input.feed_size'] <= 30) & 
    (temp['rougher.input.feed_size'] > 24) & (temp['size_sol_difference'] <= 12) 
    & (temp['size_sol_difference'] > -5)].max())
print()
print("Length")
display(len(temp[(temp['rougher.input.feed_size'] <= 30) & 
    (temp['rougher.input.feed_size'] > 24) & (temp['size_sol_difference'] <= 12) & 
    (temp['size_sol_difference'] > -5)]))
print()
print()
print("Histogram: feedsize Between 24 - 30; Difference Between -5 - 12")
temp[(temp['rougher.input.feed_size'] <= 30) & 
    (temp['rougher.input.feed_size'] > 24) & (temp['size_sol_difference'] <= 12) & 
    (temp['size_sol_difference'] > -5)].hist()
plt.show()
print()
print()
print()
print(f"feedsize between 24 - 30 & Differences Less Than -5")
print("-----------------------------------")
print("Median")
display(temp[(temp['rougher.input.feed_size'] <= 30) & 
    (temp['rougher.input.feed_size'] > 24) & (temp['size_sol_difference'] <= -5)].median())
print()
print("Mean")
display(temp[(temp['rougher.input.feed_size'] <= 30) & 
    (temp['rougher.input.feed_size'] > 24) & (temp['size_sol_difference'] <= -5)].mean())
print()
print("Min")
display(temp[(temp['rougher.input.feed_size'] <= 30) & 
    (temp['rougher.input.feed_size'] > 24) & (temp['size_sol_difference'] <= -5)].min())
print()
print("Max")
display(temp[(temp['rougher.input.feed_size'] <= 30) & 
    (temp['rougher.input.feed_size'] > 24) & (temp['size_sol_difference'] <= -5)].max())
print()
print("Length")
display(len(temp[(temp['rougher.input.feed_size'] <= 30) & 
    (temp['rougher.input.feed_size'] > 24) & (temp['size_sol_difference'] <= -5)]))
print()
print()
print("Histogram: feed_size Between 24 - 30; Difference < -5")
temp[(temp['rougher.input.feed_size'] <= 30) & 
    (temp['rougher.input.feed_size'] > 24) & (temp['size_sol_difference'] <= -5)].hist()
plt.show()
feed_size between 24 - 30 & Differences Greater Than 12
-----------------------------------
Median
rougher.input.feed_size   NaN
rougher.input.feed_sol    NaN
size_sol_difference       NaN
dtype: float64
Mean
rougher.input.feed_size   NaN
rougher.input.feed_sol    NaN
size_sol_difference       NaN
dtype: float64
Min
rougher.input.feed_size   NaN
rougher.input.feed_sol    NaN
size_sol_difference       NaN
dtype: float64
Max
rougher.input.feed_size   NaN
rougher.input.feed_sol    NaN
size_sol_difference       NaN
dtype: float64
Length
0

Histogram: feed_size Between 24 - 30; Difference > 12
No description has been provided for this image


feed_size between 24 - 30 & Difference Between -5 - 12
-----------------------------------
Median
rougher.input.feed_size    27.109922
rougher.input.feed_sol     25.216914
size_sol_difference         2.251916
dtype: float64
Mean
rougher.input.feed_size    27.041810
rougher.input.feed_sol     24.346161
size_sol_difference         2.695650
dtype: float64
Min
rougher.input.feed_size    24.791570
rougher.input.feed_sol     16.127246
size_sol_difference        -4.211725
dtype: float64
Max
rougher.input.feed_size    29.094951
rougher.input.feed_sol     31.378698
size_sol_difference        11.791231
dtype: float64
Length
10

Histogram: feedsize Between 24 - 30; Difference Between -5 - 12
No description has been provided for this image


feedsize between 24 - 30 & Differences Less Than -5
-----------------------------------
Median
rougher.input.feed_size    26.900620
rougher.input.feed_sol     39.273575
size_sol_difference       -11.222353
dtype: float64
Mean
rougher.input.feed_size    27.504092
rougher.input.feed_sol     38.441521
size_sol_difference       -10.937429
dtype: float64
Min
rougher.input.feed_size    25.523679
rougher.input.feed_sol     34.017529
size_sol_difference       -13.428541
dtype: float64
Max
rougher.input.feed_size    28.915604
rougher.input.feed_sol     40.432298
size_sol_difference        -6.391390
dtype: float64
Length
33

Histogram: feed_size Between 24 - 30; Difference < -5
No description has been provided for this image

Summary Notes: rougher.input.feed_size (24 - 30)

  • In the case, when rougher.input.feed_size is between 25.5 - 30 the rougher.input.feed_sol ranges between 34 - 41 w/ median 39.273575
  • This will be better than using the difference number as the range is spread from (-)14 - (-)6
  • While this is only a difference of 1, there is a cluster between 37 - 41 for rougher.input.feed_sol
  • However, there is a cluster from -8 - (-)14 for the difference of size - sol
  • Similarly, for the cases below 25.5 (see min range - rougher.input.feed_size: "feedsize between 24 - 30 & Differences Less Than -5"), it is better to use the median for sol (25.216914) as the range is smaller
In [110]:
# Input 25.2 in the test set for all values under 25.5 in feed_size
temp_new = gold_test_new2[['rougher.input.feed_size','rougher.input.feed_sol']]
temp_new_df = temp_new[(temp_new['rougher.input.feed_size'].notna()) & 
    (temp_new['rougher.input.feed_size'] < 25.5) & temp_new['rougher.input.feed_sol'].isna()]
temp_new_index = temp_new_df.index
temp_new_index
gold_test_new2.loc[temp_new_index,['rougher.input.feed_sol']] = 25.216914
In [111]:
# Input 39.273575 in the test set for values under 30 in feed_size when sol is NaN
temp_new39 = gold_test_new2[['rougher.input.feed_size','rougher.input.feed_sol']]
temp_new39_df = temp_new39[(temp_new39['rougher.input.feed_size'].notna()) & 
    (temp_new39['rougher.input.feed_size'] < 30) & temp_new39['rougher.input.feed_sol'].isna()]
temp_new39_index = temp_new39_df.index
gold_test_new2.loc[temp_new39_index,['rougher.input.feed_sol']] = 39.273575
In [112]:
print(f"feed_size between 30 - 35 & Differences Greater Than 30")
print("-----------------------------------")
print("Median")
display(temp[(temp['rougher.input.feed_size'] <= 35) & 
    (temp['rougher.input.feed_size'] > 30) & (temp['size_sol_difference'] > 30)].median())
print()
print("Mean")
display(temp[(temp['rougher.input.feed_size'] <= 35) & 
    (temp['rougher.input.feed_size'] > 30) & (temp['size_sol_difference'] > 30)].mean())
print()
print("Min")
display(temp[(temp['rougher.input.feed_size'] <= 35) & 
    (temp['rougher.input.feed_size'] > 30) & (temp['size_sol_difference'] > 30)].min())
print()
print("Max")
display(temp[(temp['rougher.input.feed_size'] <= 35) & 
    (temp['rougher.input.feed_size'] > 30) & (temp['size_sol_difference'] > 30)].max())
print()
print("Length")
display(len(temp[(temp['rougher.input.feed_size'] <= 35) & 
    (temp['rougher.input.feed_size'] > 30) & (temp['size_sol_difference'] > 30)]))
print()
print()
print("Histogram: feed_size Between 30 - 35; Difference > 30")
temp[(temp['rougher.input.feed_size'] <= 35) & 
    (temp['rougher.input.feed_size'] > 30) & (temp['size_sol_difference'] > 30)].hist()
plt.show()

print()
print()
print()
print(f"feed_size between 30 - 35 & Difference Between 4 - 30")
print("-----------------------------------")
print("Median")
display(temp[(temp['rougher.input.feed_size'] <= 35) & 
    (temp['rougher.input.feed_size'] > 30) & (temp['size_sol_difference'] <= 30) & 
    (temp['size_sol_difference'] > 4)].median())
print()
print("Mean")
display(temp[(temp['rougher.input.feed_size'] <= 35) & 
    (temp['rougher.input.feed_size'] > 30) & (temp['size_sol_difference'] <= 30) 
    & (temp['size_sol_difference'] > 4)].mean())
print()
print("Min")
display(temp[(temp['rougher.input.feed_size'] <= 35) & 
    (temp['rougher.input.feed_size'] > 30) & (temp['size_sol_difference'] <= 30) & 
    (temp['size_sol_difference'] > 4)].min())
print()
print("Max")
display(temp[(temp['rougher.input.feed_size'] <= 35) & 
    (temp['rougher.input.feed_size'] > 30) & (temp['size_sol_difference'] <= 30) 
    & (temp['size_sol_difference'] > 4)].max())
print()
print("Length")
display(len(temp[(temp['rougher.input.feed_size'] <= 35) & 
    (temp['rougher.input.feed_size'] > 30) & (temp['size_sol_difference'] <= 30) & 
    (temp['size_sol_difference'] > 4)]))
print()
print()
print("Histogram: feedsize Between 30 - 35; Difference Between 4 - 30")
temp[(temp['rougher.input.feed_size'] <= 35) & 
    (temp['rougher.input.feed_size'] > 30) & (temp['size_sol_difference'] <= 30) & 
    (temp['size_sol_difference'] > 4)].hist()
plt.show()
print()
print()
print()

print(f"feed_size between 30 - 35 & Differences Less Than 4")
print("-----------------------------------")
print("Median")
display(temp[(temp['rougher.input.feed_size'] <= 35) & 
    (temp['rougher.input.feed_size'] > 30) & (temp['size_sol_difference'] < 4)].median())
print()
print("Mean")
display(temp[(temp['rougher.input.feed_size'] <= 35) & 
    (temp['rougher.input.feed_size'] > 30) & (temp['size_sol_difference'] < 4)].mean())
print()
print("Min")
display(temp[(temp['rougher.input.feed_size'] <= 35) & 
    (temp['rougher.input.feed_size'] > 30) & (temp['size_sol_difference'] < 4)].min())
print()
print("Max")
display(temp[(temp['rougher.input.feed_size'] <= 35) & 
    (temp['rougher.input.feed_size'] > 30) & (temp['size_sol_difference'] < 4)].max())
print()
print("Length")
display(len(temp[(temp['rougher.input.feed_size'] <= 35) & 
    (temp['rougher.input.feed_size'] > 30) & (temp['size_sol_difference'] < 4)]))
print()
print()
print("Histogram: feed_size Between 30 - 35; Difference < 4")
temp[(temp['rougher.input.feed_size'] <= 35) & 
    (temp['rougher.input.feed_size'] > 30) & (temp['size_sol_difference'] < 4)].hist()
plt.show()
feed_size between 30 - 35 & Differences Greater Than 30
-----------------------------------
Median
rougher.input.feed_size    31.828881
rougher.input.feed_sol      0.295493
size_sol_difference        31.736166
dtype: float64
Mean
rougher.input.feed_size    32.595230
rougher.input.feed_sol      0.926313
size_sol_difference        31.668917
dtype: float64
Min
rougher.input.feed_size    31.747198
rougher.input.feed_sol      0.010000
size_sol_difference        31.451704
dtype: float64
Max
rougher.input.feed_size    34.209612
rougher.input.feed_sol      2.473446
size_sol_difference        31.818881
dtype: float64
Length
3

Histogram: feed_size Between 30 - 35; Difference > 30
No description has been provided for this image


feed_size between 30 - 35 & Difference Between 4 - 30
-----------------------------------
Median
rougher.input.feed_size    32.845521
rougher.input.feed_sol     22.693740
size_sol_difference         9.791736
dtype: float64
Mean
rougher.input.feed_size    32.826805
rougher.input.feed_sol     22.981714
size_sol_difference         9.845092
dtype: float64
Min
rougher.input.feed_size    30.488406
rougher.input.feed_sol     14.251787
size_sol_difference         4.148806
dtype: float64
Max
rougher.input.feed_size    34.803126
rougher.input.feed_sol     30.654320
size_sol_difference        16.747172
dtype: float64
Length
15

Histogram: feedsize Between 30 - 35; Difference Between 4 - 30
No description has been provided for this image


feed_size between 30 - 35 & Differences Less Than 4
-----------------------------------
Median
rougher.input.feed_size    31.766950
rougher.input.feed_sol     36.606883
size_sol_difference        -4.452920
dtype: float64
Mean
rougher.input.feed_size    31.908016
rougher.input.feed_sol     36.336713
size_sol_difference        -4.428697
dtype: float64
Min
rougher.input.feed_size    30.338484
rougher.input.feed_sol     30.203501
size_sol_difference       -12.369594
dtype: float64
Max
rougher.input.feed_size    34.911402
rougher.input.feed_sol     43.031938
size_sol_difference         3.040533
dtype: float64
Length
60

Histogram: feed_size Between 30 - 35; Difference < 4
No description has been provided for this image
In [113]:
# Same instance as before, we will use the median for rougher.input.feed_sol - 36.606883
temp_new36 = gold_test_new2[['rougher.input.feed_size','rougher.input.feed_sol']]
temp_new36_df = temp_new36[(temp_new36['rougher.input.feed_size'].notna()) & 
    (temp_new36['rougher.input.feed_size'] < 35) & temp_new36['rougher.input.feed_sol'].isna()]
temp_new36_index = temp_new36_df.index
gold_test_new2.loc[temp_new36_index,['rougher.input.feed_sol']] = 36.606883
In [114]:
print(f"feed_size between 35 - 38 & Differences Less Than 32")
print("-----------------------------------")
print("Median")
display(temp[(temp['rougher.input.feed_size'] <= 38) & 
    (temp['rougher.input.feed_size'] > 35) & (temp['size_sol_difference'] < 32)].median())
print()
print("Mean")
display(temp[(temp['rougher.input.feed_size'] <= 38) & 
    (temp['rougher.input.feed_size'] > 35) & (temp['size_sol_difference'] < 32)].mean())
print()
print("Min")
display(temp[(temp['rougher.input.feed_size'] <= 38) & 
    (temp['rougher.input.feed_size'] > 35) & (temp['size_sol_difference'] < 32)].min())
print()
print("Max")
display(temp[(temp['rougher.input.feed_size'] <= 38) & 
    (temp['rougher.input.feed_size'] > 35) & (temp['size_sol_difference'] < 32)].max())
print()
print("Length")
display(len(temp[(temp['rougher.input.feed_size'] <= 38) & 
    (temp['rougher.input.feed_size'] > 35) & (temp['size_sol_difference'] < 32)]))
print()
print()
print("Histogram: feed_size Between 35 - 38; Difference < 32")
temp[(temp['rougher.input.feed_size'] <= 38) & (temp['rougher.input.feed_size'] > 35) & (temp['size_sol_difference'] < 32)].hist()
plt.show()
feed_size between 35 - 38 & Differences Less Than 32
-----------------------------------
Median
rougher.input.feed_size    37.057387
rougher.input.feed_sol     30.245821
size_sol_difference         6.905680
dtype: float64
Mean
rougher.input.feed_size    36.881165
rougher.input.feed_sol     27.823380
size_sol_difference         9.057785
dtype: float64
Min
rougher.input.feed_size    35.001415
rougher.input.feed_sol      5.495734
size_sol_difference        -3.183538
dtype: float64
Max
rougher.input.feed_size    37.997160
rougher.input.feed_sol     40.192872
size_sol_difference        31.367351
dtype: float64
Length
97

Histogram: feed_size Between 35 - 38; Difference < 32
No description has been provided for this image
In [115]:
# Use the difference of 6.9 as the range is slightly closer than the sol median
replace_add(gold_test_new2,'rougher.input.feed_sol','rougher.input.feed_size',35,38,6.9)
In [116]:
print(f"feedsize between 40 - 50 & Differences Less Than 50")
print("-----------------------------------")
print("Median")
display(temp[(temp['rougher.input.feed_size'] <= 50) & 
    (temp['rougher.input.feed_size'] > 40) & (temp['size_sol_difference'] <= 50)].median())
print()
print("Mean")
display(temp[(temp['rougher.input.feed_size'] <= 50) & 
    (temp['rougher.input.feed_size'] > 40) & (temp['size_sol_difference'] <= 50)].mean())
print()
print("Min")
display(temp[(temp['rougher.input.feed_size'] <= 50) & 
    (temp['rougher.input.feed_size'] > 40) & (temp['size_sol_difference'] <= 50)].min())
print()
print("Max")
display(temp[(temp['rougher.input.feed_size'] <= 50) & 
    (temp['rougher.input.feed_size'] > 40) & (temp['size_sol_difference'] <= 50)].max())
print()
print("Length")
display(len(temp[(temp['rougher.input.feed_size'] <= 50) & 
    (temp['rougher.input.feed_size'] > 40) & (temp['size_sol_difference'] <= 50)]))
print()
print()
print("Histogram: feed_size Between 40 - 50; Difference < 50")
temp[(temp['rougher.input.feed_size'] <= 50) & 
    (temp['rougher.input.feed_size'] > 40) & (temp['size_sol_difference'] <= 50)].hist()
plt.show()
feedsize between 40 - 50 & Differences Less Than 50
-----------------------------------
Median
rougher.input.feed_size    46.701653
rougher.input.feed_sol     34.544858
size_sol_difference        11.808177
dtype: float64
Mean
rougher.input.feed_size    46.346161
rougher.input.feed_sol     33.841244
size_sol_difference        12.504918
dtype: float64
Min
rougher.input.feed_size    40.000617
rougher.input.feed_sol      1.169889
size_sol_difference        -0.736386
dtype: float64
Max
rougher.input.feed_size    49.996193
rougher.input.feed_sol     45.955078
size_sol_difference        47.611883
dtype: float64
Length
3866

Histogram: feed_size Between 40 - 50; Difference < 50
No description has been provided for this image
In [117]:
# A significant portion of this range (feed_size: 40 - 50) from the training set has a difference of ~11.8 from feed_sol
replace_add(gold_test_new2,'rougher.input.feed_sol','rougher.input.feed_size',40,50,11.8)
In [118]:
print(f"feed_size between 50 - 60 & Differences Less Than 55")
print("-----------------------------------")
print("Median")
display(temp[(temp['rougher.input.feed_size'] <= 60) & 
    (temp['rougher.input.feed_size'] > 50) & (temp['size_sol_difference'] < 55)].median())
print()
print("Mean")
display(temp[(temp['rougher.input.feed_size'] <= 60) & 
    (temp['rougher.input.feed_size'] > 50) & (temp['size_sol_difference'] < 55)].mean())
print()
print("Min")
display(temp[(temp['rougher.input.feed_size'] <= 60) & 
    (temp['rougher.input.feed_size'] > 50) & (temp['size_sol_difference'] < 55)].min())
print()
print("Max")
display(temp[(temp['rougher.input.feed_size'] <= 60) & 
    (temp['rougher.input.feed_size'] > 50) & (temp['size_sol_difference'] < 55)].max())
print()
print("Length")
display(len(temp[(temp['rougher.input.feed_size'] <= 60) & 
    (temp['rougher.input.feed_size'] > 50) & (temp['size_sol_difference'] < 55)]))
print()
print()
print("Histogram: feed_size Between 50 - 60; Difference < 55")
temp[(temp['rougher.input.feed_size'] <= 60) & 
    (temp['rougher.input.feed_size'] > 50) & (temp['size_sol_difference'] < 55)].hist()
plt.show()
feed_size between 50 - 60 & Differences Less Than 55
-----------------------------------
Median
rougher.input.feed_size    54.413676
rougher.input.feed_sol     36.802788
size_sol_difference        17.617468
dtype: float64
Mean
rougher.input.feed_size    54.547582
rougher.input.feed_sol     36.255374
size_sol_difference        18.292209
dtype: float64
Min
rougher.input.feed_size    50.004500
rougher.input.feed_sol      2.703490
size_sol_difference         6.438439
dtype: float64
Max
rougher.input.feed_size    59.998321
rougher.input.feed_sol     47.307449
size_sol_difference        53.629025
dtype: float64
Length
4927

Histogram: feed_size Between 50 - 60; Difference < 55
No description has been provided for this image
In [119]:
# A large number in this range (feed_size: 50 - 60) have a median of 36.802788
# Input the median for this range
temp_new368 = gold_test_new2[['rougher.input.feed_size','rougher.input.feed_sol']]
temp_new368_df = temp_new368[(temp_new368['rougher.input.feed_size'].notna()) & 
    (temp_new368['rougher.input.feed_size'] < 60) & temp_new368['rougher.input.feed_sol'].isna()]
temp_new368_index = temp_new368_df.index
gold_test_new2.loc[temp_new368_index,['rougher.input.feed_sol']] = 36.802788
In [120]:
print(f"feedsize between 60 - 70 & Differences Less Than 70")
print("-----------------------------------")
print("Median")
display(temp[(temp['rougher.input.feed_size'] <= 70) & 
    (temp['rougher.input.feed_size'] > 60) & (temp['size_sol_difference'] <= 70)].median())
print()
print("Mean")
display(temp[(temp['rougher.input.feed_size'] <= 70) & 
    (temp['rougher.input.feed_size'] > 60) & (temp['size_sol_difference'] <= 70)].mean())
print()
print("Min")
display(temp[(temp['rougher.input.feed_size'] <= 70) & 
    (temp['rougher.input.feed_size'] > 60) & (temp['size_sol_difference'] <= 70)].min())
print()
print("Max")
display(temp[(temp['rougher.input.feed_size'] <= 70) & 
    (temp['rougher.input.feed_size'] > 60) & (temp['size_sol_difference'] <= 70)].max())
print()
print("Length")
display(len(temp[(temp['rougher.input.feed_size'] <= 70) & 
    (temp['rougher.input.feed_size'] > 60) & (temp['size_sol_difference'] <= 70)]))
print()
print("Histogram: feed_size Between 60 - 70; Difference < 70")
temp[(temp['rougher.input.feed_size'] <= 70) & 
    (temp['rougher.input.feed_size'] > 60) & (temp['size_sol_difference'] <= 70)].hist()
plt.show()
feedsize between 60 - 70 & Differences Less Than 70
-----------------------------------
Median
rougher.input.feed_size    64.135303
rougher.input.feed_sol     38.944012
size_sol_difference        25.295020
dtype: float64
Mean
rougher.input.feed_size    64.443953
rougher.input.feed_sol     38.459652
size_sol_difference        25.984302
dtype: float64
Min
rougher.input.feed_size    60.001225
rougher.input.feed_sol      0.010000
size_sol_difference        15.127535
dtype: float64
Max
rougher.input.feed_size    69.998638
rougher.input.feed_sol     48.363177
size_sol_difference        69.240783
dtype: float64
Length
2259
Histogram: feed_size Between 60 - 70; Difference < 70
No description has been provided for this image
In [121]:
# A large number in this range (feed_size: 60 - 70) have a median of 38.944012
# Input the median for this range
temp_new389 = gold_test_new2[['rougher.input.feed_size','rougher.input.feed_sol']]
temp_new389_df = temp_new389[(temp_new389['rougher.input.feed_size'].notna()) & 
    (temp_new389['rougher.input.feed_size'] < 70) & temp_new389['rougher.input.feed_sol'].isna()]
temp_new389_index = temp_new389_df.index
gold_test_new2.loc[temp_new389_index,['rougher.input.feed_sol']] = 38.944012
In [122]:
print(f"feed_size between 70 - 75 & Differences Greater Than 40")
print("-----------------------------------")
print("Median")
display(temp[(temp['rougher.input.feed_size'] <= 75) & 
    (temp['rougher.input.feed_size'] > 70) & (temp['size_sol_difference'] > 41)].median())
print()
print("Mean")
display(temp[(temp['rougher.input.feed_size'] <= 75) & 
    (temp['rougher.input.feed_size'] > 70) & (temp['size_sol_difference'] > 41)].mean())
print()
print("Min")
display(temp[(temp['rougher.input.feed_size'] <= 75) & 
    (temp['rougher.input.feed_size'] > 70) & (temp['size_sol_difference'] > 41)].min())
print()
print("Max")
display(temp[(temp['rougher.input.feed_size'] <= 75) & 
    (temp['rougher.input.feed_size'] > 70) & (temp['size_sol_difference'] > 41)].max())
print()
print("Length")
display(len(temp[(temp['rougher.input.feed_size'] <= 75) & 
    (temp['rougher.input.feed_size'] > 70) & (temp['size_sol_difference'] > 41)]))
print()
print()
print("Histogram: feed_size Between 70 - 75; Difference > 41")
temp[(temp['rougher.input.feed_size'] <= 75) & 
    (temp['rougher.input.feed_size'] > 70) & (temp['size_sol_difference'] > 41)].hist()
plt.show()

print()
print()
print()

print(f"feed_size between 70 - 75 & Difference Between 28 - 41")
print("-----------------------------------")
print("Median")
display(temp[(temp['rougher.input.feed_size'] <= 75) & 
    (temp['rougher.input.feed_size'] > 70) & (temp['size_sol_difference'] <= 41) & 
    (temp['size_sol_difference'] > 28)].median())
print()
print("Mean")
display(temp[(temp['rougher.input.feed_size'] <= 75) & 
    (temp['rougher.input.feed_size'] > 70) & (temp['size_sol_difference'] <= 41) 
    & (temp['size_sol_difference'] > 28)].mean())
print()
print("Min")
display(temp[(temp['rougher.input.feed_size'] <= 75) & 
    (temp['rougher.input.feed_size'] > 70) & (temp['size_sol_difference'] <= 41) & 
    (temp['size_sol_difference'] > 28)].min())
print()
print("Max")
display(temp[(temp['rougher.input.feed_size'] <= 75) & 
    (temp['rougher.input.feed_size'] > 70) & (temp['size_sol_difference'] <= 41) 
    & (temp['size_sol_difference'] > 28)].max())
print()
print("Length")
display(len(temp[(temp['rougher.input.feed_size'] <= 75) & 
    (temp['rougher.input.feed_size'] > 70) & (temp['size_sol_difference'] <= 41) & 
    (temp['size_sol_difference'] > 28)]))
print()
print()
print("Histogram: feedsize Between 70 - 75; Difference Between 28 - 41")
temp[(temp['rougher.input.feed_size'] <= 75) & 
    (temp['rougher.input.feed_size'] > 70) & (temp['size_sol_difference'] <= 41) & 
    (temp['size_sol_difference'] > 28)].hist()
plt.show()
print()
print()
print()


print(f"feedsize between 70 - 75 & Differences Less Than 28")
print("-----------------------------------")
print("Median")
display(temp[(temp['rougher.input.feed_size'] <= 75) & 
    (temp['rougher.input.feed_size'] > 70) & (temp['size_sol_difference'] <= 28)].median())
print()
print("Mean")
display(temp[(temp['rougher.input.feed_size'] <= 75) & 
    (temp['rougher.input.feed_size'] > 70) & (temp['size_sol_difference'] <= 28)].mean())
print()
print("Min")
display(temp[(temp['rougher.input.feed_size'] <= 75) & 
    (temp['rougher.input.feed_size'] > 70) & (temp['size_sol_difference'] <= 28)].min())
print()
print("Max")
display(temp[(temp['rougher.input.feed_size'] <= 75) & 
    (temp['rougher.input.feed_size'] > 70) & (temp['size_sol_difference'] <= 28)].max())
print()
print("Length")
display(len(temp[(temp['rougher.input.feed_size'] <= 75) & 
    (temp['rougher.input.feed_size'] > 70) & (temp['size_sol_difference'] <= 28)]))
print()
print("Histogram: feed_size Between 70 - 75; Difference < 28")
temp[(temp['rougher.input.feed_size'] <= 75) & 
    (temp['rougher.input.feed_size'] > 70) & (temp['size_sol_difference'] <= 28)].hist()
plt.show()
feed_size between 70 - 75 & Differences Greater Than 40
-----------------------------------
Median
rougher.input.feed_size    73.703218
rougher.input.feed_sol     27.992614
size_sol_difference        43.988461
dtype: float64
Mean
rougher.input.feed_size    73.073895
rougher.input.feed_sol     27.353125
size_sol_difference        45.720770
dtype: float64
Min
rougher.input.feed_size    70.327801
rougher.input.feed_sol     16.194259
size_sol_difference        41.167205
dtype: float64
Max
rougher.input.feed_size    74.705631
rougher.input.feed_sol     33.315613
size_sol_difference        56.588363
dtype: float64
Length
31

Histogram: feed_size Between 70 - 75; Difference > 41
No description has been provided for this image


feed_size between 70 - 75 & Difference Between 28 - 41
-----------------------------------
Median
rougher.input.feed_size    72.809390
rougher.input.feed_sol     39.410044
size_sol_difference        33.272169
dtype: float64
Mean
rougher.input.feed_size    72.597292
rougher.input.feed_sol     39.118198
size_sol_difference        33.479093
dtype: float64
Min
rougher.input.feed_size    70.000791
rougher.input.feed_sol     30.481283
size_sol_difference        28.045006
dtype: float64
Max
rougher.input.feed_size    74.997287
rougher.input.feed_sol     45.500334
size_sol_difference        40.952899
dtype: float64
Length
777

Histogram: feedsize Between 70 - 75; Difference Between 28 - 41
No description has been provided for this image


feedsize between 70 - 75 & Differences Less Than 28
-----------------------------------
Median
rougher.input.feed_size    71.141761
rougher.input.feed_sol     44.091765
size_sol_difference        27.536873
dtype: float64
Mean
rougher.input.feed_size    71.351773
rougher.input.feed_sol     44.235818
size_sol_difference        27.115954
dtype: float64
Min
rougher.input.feed_size    70.025827
rougher.input.feed_sol     42.187260
size_sol_difference        24.766180
dtype: float64
Max
rougher.input.feed_size    74.585721
rougher.input.feed_sol     47.226941
size_sol_difference        27.970435
dtype: float64
Length
34
Histogram: feed_size Between 70 - 75; Difference < 28
No description has been provided for this image
In [123]:
# A large portion of this range (feed_size: 70 - 75) -  rougher.input.feed_sol median - 39.410044

temp_new394 = gold_test_new2[['rougher.input.feed_size','rougher.input.feed_sol']]
temp_new394_df = temp_new394[(temp_new394['rougher.input.feed_size'].notna()) & 
    (temp_new394['rougher.input.feed_size'] < 75) & temp_new394['rougher.input.feed_sol'].isna()]
temp_new394_index = temp_new394_df.index
gold_test_new2.loc[temp_new394_index,['rougher.input.feed_sol']] = 39.410044
In [124]:
print(f"feedsize between 80 - 85 & Differences Greater Than 47")
print("-----------------------------------")
print("Median")
display(temp[(temp['rougher.input.feed_size'] >= 85) & 
    (temp['rougher.input.feed_size'] > 80) & (temp['size_sol_difference'] > 47)].median())
print()
print("Mean")
display(temp[(temp['rougher.input.feed_size'] <= 85) & 
    (temp['rougher.input.feed_size'] > 80) & (temp['size_sol_difference'] > 47)].mean())
print()
print("Min")
display(temp[(temp['rougher.input.feed_size'] <= 85) & 
    (temp['rougher.input.feed_size'] > 80) & (temp['size_sol_difference'] > 47)].min())
print()
print("Max")
display(temp[(temp['rougher.input.feed_size'] <= 85) & 
    (temp['rougher.input.feed_size'] > 80) & (temp['size_sol_difference'] > 47)].max())
print()
print("Length")
display(len(temp[(temp['rougher.input.feed_size'] <= 85) & 
    (temp['rougher.input.feed_size'] > 80) & (temp['size_sol_difference'] > 47)]))
print()
print("Histogram: feed_size Between 80 - 85; Difference > 47")
temp[(temp['rougher.input.feed_size'] <= 85) & 
    (temp['rougher.input.feed_size'] > 80) & (temp['size_sol_difference'] > 47)].hist()
plt.show()
print()
print()
print()

print(f"feed_size between 80 - 85 & Difference Between 37 - 47")
print("-----------------------------------")
print("Median")
display(temp[(temp['rougher.input.feed_size'] <= 85) & 
    (temp['rougher.input.feed_size'] > 80) & (temp['size_sol_difference'] <= 47) & 
    (temp['size_sol_difference'] > 37)].median())
print()
print("Mean")
display(temp[(temp['rougher.input.feed_size'] <= 85) & 
    (temp['rougher.input.feed_size'] > 80) & (temp['size_sol_difference'] <= 47) 
    & (temp['size_sol_difference'] > 37)].mean())
print()
print("Min")
display(temp[(temp['rougher.input.feed_size'] <= 85) & 
    (temp['rougher.input.feed_size'] > 80) & (temp['size_sol_difference'] <= 47) & 
    (temp['size_sol_difference'] > 37)].min())
print()
print("Max")
display(temp[(temp['rougher.input.feed_size'] <= 85) & 
    (temp['rougher.input.feed_size'] > 80) & (temp['size_sol_difference'] <= 47) 
    & (temp['size_sol_difference'] > 37)].max())
print()
print("Length")
display(len(temp[(temp['rougher.input.feed_size'] <= 85) & 
    (temp['rougher.input.feed_size'] > 80) & (temp['size_sol_difference'] <= 47) & 
    (temp['size_sol_difference'] > 37)]))
print()
print()
print("Histogram: feedsize Between 80 - 85; Difference Between 37 - 47")
temp[(temp['rougher.input.feed_size'] <= 85) & 
    (temp['rougher.input.feed_size'] > 80) & (temp['size_sol_difference'] <= 47) & 
    (temp['size_sol_difference'] > 37)].hist()
plt.show()
print()
print()
print()

print(f"feedsize between 80 - 85 & Differences Less Than 37")
print("-----------------------------------")
print("Median")
display(temp[(temp['rougher.input.feed_size'] <= 85) & 
    (temp['rougher.input.feed_size'] > 80) & (temp['size_sol_difference'] <= 37)].median())
print()
print("Mean")
display(temp[(temp['rougher.input.feed_size'] <= 85) & 
    (temp['rougher.input.feed_size'] > 80) & (temp['size_sol_difference'] <= 37)].mean())
print()
print("Min")
display(temp[(temp['rougher.input.feed_size'] <= 85) & 
    (temp['rougher.input.feed_size'] > 80) & (temp['size_sol_difference'] <= 37)].min())
print()
print("Max")
display(temp[(temp['rougher.input.feed_size'] <= 85) & 
    (temp['rougher.input.feed_size'] > 80) & (temp['size_sol_difference'] <= 37)].max())
print()
print("Length")
display(len(temp[(temp['rougher.input.feed_size'] <= 85) & 
    (temp['rougher.input.feed_size'] > 80) & (temp['size_sol_difference'] <= 37)]))
print()
print("Histogram: feed_size Between 80 - 85; Difference < 37")
temp[(temp['rougher.input.feed_size'] <= 85) & 
    (temp['rougher.input.feed_size'] > 80) & (temp['size_sol_difference'] <= 37)].hist()
plt.show()
feedsize between 80 - 85 & Differences Greater Than 47
-----------------------------------
Median
rougher.input.feed_size    100.347223
rougher.input.feed_sol      35.775333
size_sol_difference         67.731138
dtype: float64
Mean
rougher.input.feed_size    82.682123
rougher.input.feed_sol     26.101812
size_sol_difference        56.580310
dtype: float64
Min
rougher.input.feed_size    80.058526
rougher.input.feed_sol      0.010000
size_sol_difference        47.074358
dtype: float64
Max
rougher.input.feed_size    84.864875
rougher.input.feed_sol     37.683318
size_sol_difference        81.133699
dtype: float64
Length
56
Histogram: feed_size Between 80 - 85; Difference > 47
No description has been provided for this image


feed_size between 80 - 85 & Difference Between 37 - 47
-----------------------------------
Median
rougher.input.feed_size    82.580666
rougher.input.feed_sol     40.657336
size_sol_difference        42.245351
dtype: float64
Mean
rougher.input.feed_size    82.487730
rougher.input.feed_sol     40.413408
size_sol_difference        42.074321
dtype: float64
Min
rougher.input.feed_size    80.015783
rougher.input.feed_sol     33.561037
size_sol_difference        37.013060
dtype: float64
Max
rougher.input.feed_size    84.990078
rougher.input.feed_sol     46.260368
size_sol_difference        46.978386
dtype: float64
Length
479

Histogram: feedsize Between 80 - 85; Difference Between 37 - 47
No description has been provided for this image


feedsize between 80 - 85 & Differences Less Than 37
-----------------------------------
Median
rougher.input.feed_size    80.642546
rougher.input.feed_sol     44.916304
size_sol_difference        36.075423
dtype: float64
Mean
rougher.input.feed_size    80.898649
rougher.input.feed_sol     45.085368
size_sol_difference        35.813280
dtype: float64
Min
rougher.input.feed_size    80.000144
rougher.input.feed_sol     43.227229
size_sol_difference        32.973632
dtype: float64
Max
rougher.input.feed_size    83.283444
rougher.input.feed_sol     47.376957
size_sol_difference        36.984625
dtype: float64
Length
22
Histogram: feed_size Between 80 - 85; Difference < 37
No description has been provided for this image
In [125]:
# A large portion of this range (feed_size: 80 - 85) -  rougher.input.feed_sol median - 40.657336

temp_new40 = gold_test_new2[['rougher.input.feed_size','rougher.input.feed_sol']]
temp_new40_df = temp_new40[(temp_new40['rougher.input.feed_size'].notna()) & 
    (temp_new40['rougher.input.feed_size'] < 83) & temp_new40['rougher.input.feed_sol'].isna()]
temp_new40_index = temp_new40_df.index
gold_test_new2.loc[temp_new40_index,['rougher.input.feed_sol']] = 40.657336

Imputation Strategy for rougher.input.feed_sol

Analysis of the relationship between rougher.input.feed_sol and rougher.input.feed_size reveals that the optimal imputation strategy varies across different feed_size ranges. Unlike the sulfate measurements which showed consistent conditional patterns, the feed_sol relationship requires a range-specific approach using either median values or calculated offsets depending on data distribution characteristics.

image.pngimage.pngimage.pngimage.pngimage.pngimage.pngimage.pngimage.png


Statistical Summary by Feed Size Range

Feed Size Range Observations Feed Sol Median Difference Median Imputation Method Rationale
24 - 30 (diff -5 to 12) 10 25.22 2.25 Median (25.22) for feed_size < 25.5 Small sample; lower sol values
24 - 30 (diff < -5) 33 39.27 -11.22 Median (39.27) for feed_size ≥ 25.5 Better sample; sol cluster 34-40
30 - 35 60 36.61 -4.45 Median (36.61) Moderate sample; sol range 30-43
35 - 38 97 30.25 6.91 Offset (+6.9) Difference tighter than sol range
40 - 50 3,866 34.54 11.81 Offset (+11.8) Large sample; consistent difference
50 - 60 4,927 36.80 17.62 Median (36.80) Largest sample; stable median
60 - 70 2,259 38.94 25.30 Median (38.94) Large sample; stable median
70 - 75 777 39.41 33.27 Median (39.41) Moderate sample
80 - 85 479 40.66 42.25 Median (40.66) Moderate sample

Key Findings

The feed_sol and feed_size relationship is more complex than the sulfate measurements, with imputation strategies chosen based on: (1) sample size reliability, (2) whether the sol median or difference showed tighter clustering, and (3) the presence of consistent patterns. Two primary ranges dominate the training data: feed_size 40-50 (3,866 observations) and 50-60 (4,927 observations), representing the most reliable imputation zones. For smaller ranges (24-38), median sol values were preferred due to limited sample sizes and wider difference spreads. For the 35-38 and 40-50 ranges specifically, offset-based imputation was used because the difference showed more consistent patterns than the absolute sol values.


Limitations and Considerations

Unlike the sulfate imputation strategies which had strong conditional relationships (99%+ coverage with near-perfect correlation), the feed_sol imputation is less robust due to:

  • Weaker correlations: Higher variance within ranges, particularly in smaller feed_size ranges
  • Mixed methodology: Combining median and offset strategies introduces inconsistency
  • Limited observations in extremes: Only 10-97 observations in the 24-38 range reduces confidence
  • Wide difference spreads: Some ranges show difference variations of 20+ units, indicating higher uncertainty

This approach represents a pragmatic solution given the data characteristics, prioritizing the use of stable median values from large samples (4,927 and 3,866 observations in the 50-60 and 40-50 ranges) while accepting lower precision in edge cases. The imputation will be adequate for modeling purposes but carries more uncertainty than the sulfate strategies.

In [126]:
# Defina a function to input a spcecified columns median from the train dataset into the test dataset
def median_test_fillna(df_test,df_train, column):
    
    series = df_test[column]
    isna = series[series.isna()]
    series_index = isna.index
    train_median = df_train[column].median()
    
    df_test.loc[series_index,[column]] = train_median
In [127]:
# fill the rest of the NaN values for the test set


median_test_fillna(gold_test_new2, gold_train_new2,'primary_cleaner.state.floatbank8_a_air')
median_test_fillna(gold_test_new2, gold_train_new2,'primary_cleaner.state.floatbank8_a_level')
median_test_fillna(gold_test_new2, gold_train_new2,'primary_cleaner.state.floatbank8_b_air')
median_test_fillna(gold_test_new2, gold_train_new2,'primary_cleaner.state.floatbank8_b_level')
median_test_fillna(gold_test_new2, gold_train_new2,'primary_cleaner.state.floatbank8_c_air')
median_test_fillna(gold_test_new2, gold_train_new2,'primary_cleaner.state.floatbank8_c_level')
median_test_fillna(gold_test_new2, gold_train_new2,'primary_cleaner.state.floatbank8_d_air')
median_test_fillna(gold_test_new2, gold_train_new2,'primary_cleaner.state.floatbank8_d_level')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.input.feed_ag')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.input.feed_pb')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.input.feed_rate')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.input.feed_size')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.input.feed_sol')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.input.feed_au')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.input.floatbank10_sulfate')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.input.floatbank11_sulfate')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.state.floatbank10_a_air')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.state.floatbank10_a_level')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.state.floatbank10_b_air')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.state.floatbank10_b_level')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.state.floatbank10_c_air')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.state.floatbank10_c_level')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.state.floatbank10_d_air')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.state.floatbank10_d_level')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.state.floatbank10_e_air')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.state.floatbank10_e_level')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.state.floatbank10_f_air')
median_test_fillna(gold_test_new2, gold_train_new2,'rougher.state.floatbank10_f_level')
median_test_fillna(gold_test_new2, gold_train_new2,'secondary_cleaner.state.floatbank2_a_air')
median_test_fillna(gold_test_new2, gold_train_new2,'secondary_cleaner.state.floatbank2_a_level')
median_test_fillna(gold_test_new2, gold_train_new2,'secondary_cleaner.state.floatbank2_b_air')
median_test_fillna(gold_test_new2, gold_train_new2,'secondary_cleaner.state.floatbank2_b_level')
median_test_fillna(gold_test_new2, gold_train_new2,'secondary_cleaner.state.floatbank3_a_air')
median_test_fillna(gold_test_new2, gold_train_new2,'secondary_cleaner.state.floatbank3_a_level')
median_test_fillna(gold_test_new2, gold_train_new2,'secondary_cleaner.state.floatbank3_b_air')
median_test_fillna(gold_test_new2, gold_train_new2,'secondary_cleaner.state.floatbank3_b_level')
median_test_fillna(gold_test_new2, gold_train_new2,'secondary_cleaner.state.floatbank4_a_air')
median_test_fillna(gold_test_new2, gold_train_new2,'secondary_cleaner.state.floatbank4_a_level')
median_test_fillna(gold_test_new2, gold_train_new2,'secondary_cleaner.state.floatbank4_b_air')
median_test_fillna(gold_test_new2, gold_train_new2,'secondary_cleaner.state.floatbank4_b_level')
median_test_fillna(gold_test_new2, gold_train_new2,'secondary_cleaner.state.floatbank5_a_air')
median_test_fillna(gold_test_new2, gold_train_new2,'secondary_cleaner.state.floatbank5_a_level')
median_test_fillna(gold_test_new2, gold_train_new2,'secondary_cleaner.state.floatbank5_b_air')
median_test_fillna(gold_test_new2, gold_train_new2,'secondary_cleaner.state.floatbank5_b_level')
median_test_fillna(gold_test_new2, gold_train_new2,'secondary_cleaner.state.floatbank6_a_air')
median_test_fillna(gold_test_new2, gold_train_new2,'secondary_cleaner.state.floatbank6_a_level')
In [128]:
gold_test_new2.isna().sum()
Out[128]:
date                                          0
primary_cleaner.input.sulfate                 0
primary_cleaner.input.depressant              0
primary_cleaner.input.feed_size               0
primary_cleaner.input.xanthate                0
primary_cleaner.state.floatbank8_a_air        0
primary_cleaner.state.floatbank8_a_level      0
primary_cleaner.state.floatbank8_b_air        0
primary_cleaner.state.floatbank8_b_level      0
primary_cleaner.state.floatbank8_c_air        0
primary_cleaner.state.floatbank8_c_level      0
primary_cleaner.state.floatbank8_d_air        0
primary_cleaner.state.floatbank8_d_level      0
rougher.input.feed_ag                         0
rougher.input.feed_pb                         0
rougher.input.feed_rate                       0
rougher.input.feed_size                       0
rougher.input.feed_sol                        0
rougher.input.feed_au                         0
rougher.input.floatbank10_sulfate             0
rougher.input.floatbank10_xanthate            0
rougher.input.floatbank11_sulfate             0
rougher.input.floatbank11_xanthate            0
rougher.state.floatbank10_a_air               0
rougher.state.floatbank10_a_level             0
rougher.state.floatbank10_b_air               0
rougher.state.floatbank10_b_level             0
rougher.state.floatbank10_c_air               0
rougher.state.floatbank10_c_level             0
rougher.state.floatbank10_d_air               0
rougher.state.floatbank10_d_level             0
rougher.state.floatbank10_e_air               0
rougher.state.floatbank10_e_level             0
rougher.state.floatbank10_f_air               0
rougher.state.floatbank10_f_level             0
secondary_cleaner.state.floatbank2_a_air      0
secondary_cleaner.state.floatbank2_a_level    0
secondary_cleaner.state.floatbank2_b_air      0
secondary_cleaner.state.floatbank2_b_level    0
secondary_cleaner.state.floatbank3_a_air      0
secondary_cleaner.state.floatbank3_a_level    0
secondary_cleaner.state.floatbank3_b_air      0
secondary_cleaner.state.floatbank3_b_level    0
secondary_cleaner.state.floatbank4_a_air      0
secondary_cleaner.state.floatbank4_a_level    0
secondary_cleaner.state.floatbank4_b_air      0
secondary_cleaner.state.floatbank4_b_level    0
secondary_cleaner.state.floatbank5_a_air      0
secondary_cleaner.state.floatbank5_a_level    0
secondary_cleaner.state.floatbank5_b_air      0
secondary_cleaner.state.floatbank5_b_level    0
secondary_cleaner.state.floatbank6_a_air      0
secondary_cleaner.state.floatbank6_a_level    0
dtype: int64

Imputation Strategy for Remaining Columns (<1% Missing Data: Test Set)

For all remaining columns in the test set with less than 1% missing data, a simple median imputation strategy was applied using column-specific medians calculated from the training set.


Methodology

Imputation approach:

  • Calculate the median value for each column from the training set (after row-dropping preprocessing)
  • Apply these training set medians to fill missing values in the corresponding test set columns
  • No conditional logic or range-based strategies required

Rationale:

  • Minimal impact: With <1% missing data per column, the imputation method has negligible effect on model performance
  • Computational efficiency: Simple median imputation is fast and straightforward
  • Adequate accuracy: For such small percentages of missing data, sophisticated methods provide minimal benefit over median imputation
  • Consistency with training: Using training set medians (rather than test set medians) prevents data leakage and maintains proper train-test separation

Key Considerations

For columns with <1% missing data, the additional complexity of conditional imputation is unnecessary. The simple median approach provides a clean, efficient solution that maintains data integrity while having minimal impact on the final model predictions.

In [129]:
# Lastly, change the date to datetime
gold_train_new2['date'] = pd.to_datetime(gold_train_new2['date'])
gold_test_new2['date'] = pd.to_datetime(gold_test_new2['date'])
gold_full_new2['date'] = pd.to_datetime(gold_full_new2['date'])

Analyze the Data¶

Concentrations of Metals Accross Purification Stages (Au, Ag, Pb): Original Dataset¶

In [130]:
# Take note of how the concentrations of metals (Au, Ag, Pb) change depending on the purification stage.

# Using original full dataset 

# Gold(Au)
au = gold_full.filter(like = 'au', axis = 1)
au_stages = ['rougher.input.feed_au',
    'rougher.output.concentrate_au',
    'rougher.output.tail_au',
    'primary_cleaner.output.concentrate_au',
    'primary_cleaner.output.tail_au',
    'secondary_cleaner.output.tail_au',
    'final.output.concentrate_au',
    'final.output.tail_au']
au_calculations = ['rougher.calculation.sulfate_to_au_concentrate',
       'rougher.calculation.floatbank10_sulfate_to_au_feed',
       'rougher.calculation.floatbank11_sulfate_to_au_feed',
       'rougher.calculation.au_pb_ratio']

# Silver(Ag)
ag = gold_full.filter(like = 'ag', axis = 1)
ag_stages = ['rougher.input.feed_ag',
             'rougher.output.concentrate_ag',
             'rougher.output.tail_ag',
             'primary_cleaner.output.concentrate_ag',
             'primary_cleaner.output.tail_ag',
             'secondary_cleaner.output.tail_ag',
             'final.output.concentrate_ag',
             'final.output.tail_ag']

# Lead(Pb)
pb = gold_full.filter(like = 'pb', axis = 1)
pb_stages = ['rougher.input.feed_pb', 
             'rougher.output.concentrate_pb',
             'rougher.output.tail_pb',
             'primary_cleaner.output.concentrate_pb',
             'primary_cleaner.output.tail_pb', 
             'secondary_cleaner.output.tail_pb',
             'final.output.concentrate_pb',
             'final.output.tail_pb']



# All Metals (Au, Ag, Pb)

all_metal_stages = au_stages + ag_stages + pb_stages

Gold(Au): Original Dataset¶

In [131]:
# Box plot - stages
au[au_stages].plot(kind='box', vert = False,figsize=(10, 5), title='Gold (Au) Concentration by Stage', grid = True)


# Line plot - Median Stages
au_median = au[au_stages].median()
stages = ['Rougher Feed','Rougher Conc','Rougher Tail',
          'Primary Cleaner Conc','Primary Cleaner Tail','Secondary Cleaner Tail','Final Conc','Final Tail']

plt.figure(figsize=(10,5))
plt.plot(stages, au_median, marker='o', label='Au median')
plt.title("Gold (Au) Median Concentration by Stage")
plt.ylabel("Concentration")
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

# Show stats
au[au_stages].describe()
No description has been provided for this image
No description has been provided for this image
Out[131]:
rougher.input.feed_au rougher.output.concentrate_au rougher.output.tail_au primary_cleaner.output.concentrate_au primary_cleaner.output.tail_au secondary_cleaner.output.tail_au final.output.concentrate_au final.output.tail_au
count 22617.000000 22618.000000 19980.000000 22618.000000 22617.000000 22618.000000 22630.000000 22635.000000
mean 7.565838 17.879538 1.821193 29.212289 3.670333 4.041218 40.001172 2.827459
std 3.026954 6.790112 0.695663 10.539303 1.985206 2.605738 13.398062 1.262834
min 0.000000 0.000000 0.020676 0.000000 0.000000 0.000000 0.000000 0.000000
25% 6.485009 17.928729 1.403951 29.374406 2.741534 2.877554 42.383721 2.303108
50% 7.884832 20.003202 1.808567 32.359813 3.513008 3.956171 44.653436 2.913794
75% 9.668064 21.564238 2.215317 34.770726 4.559485 5.006944 46.111999 3.555077
max 14.093363 28.824507 9.688980 45.933934 18.528821 26.811643 53.611374 9.789625

Silver(Ag): Original Dataset¶

In [132]:
# Box plot - stages
ag[ag_stages].plot(kind = 'box', vert = False, figsize = (10,5),title = "Silver (Ag) Concentration by Stage", grid = True)

# Line plot - Median Stages
ag_median = ag[ag_stages].median()

plt.figure(figsize = (10,5))
plt.plot(stages,ag_median, marker='o', label = 'Ag median')
plt.title("Silver (Ag) Median Concentration by Stage")
plt.ylabel("Concentration")
plt.xticks(rotation = 45)
plt.grid(True)
plt.show()


# Show stats
ag[ag_stages].describe()
No description has been provided for this image
No description has been provided for this image
Out[132]:
rougher.input.feed_ag rougher.output.concentrate_ag rougher.output.tail_ag primary_cleaner.output.concentrate_ag primary_cleaner.output.tail_ag secondary_cleaner.output.tail_ag final.output.concentrate_ag final.output.tail_ag
count 22618.000000 22618.000000 19979.000000 22618.000000 22614.000000 22616.000000 22627.000000 22633.000000
mean 8.065715 10.874484 5.587861 7.691652 14.876219 13.375349 4.781559 8.923690
std 3.125250 4.377924 1.114614 3.109306 5.654342 5.768719 2.030128 3.517917
min 0.000000 0.000000 0.594562 0.000000 0.000000 0.000000 0.000000 0.000000
25% 6.944415 10.126646 4.887758 6.771102 13.464756 11.802936 4.018525 7.684016
50% 8.302613 11.785127 5.759630 8.265643 15.600454 15.222165 4.953729 9.484369
75% 10.135202 13.615299 6.386301 9.697896 17.931084 17.231611 5.862593 11.084557
max 14.869652 24.480271 12.719177 16.081632 29.459575 23.264738 16.001945 19.552149

Lead(Pb): Original Dataset¶

In [133]:
# Box plot - stages
pb[pb_stages].plot(kind='box', vert = False, figsize = (10,5), title = "Lead (Pb) Concentration by Stage", grid = True)

# Line plot - Median Stages
pb_median = pb[pb_stages].median()

plt.figure(figsize=(10,5))
plt.plot(stages, pb_median, marker = 'o', label = 'Pb median')
plt.title("Lead (Pb) Median Concentration by Stage")
plt.ylabel("Concentration")
plt.xticks(rotation = 45)
plt.grid(True)
plt.show()

# Show stats
pb[pb_stages].describe()
No description has been provided for this image
No description has been provided for this image
Out[133]:
rougher.input.feed_pb rougher.output.concentrate_pb rougher.output.tail_pb primary_cleaner.output.concentrate_pb primary_cleaner.output.tail_pb secondary_cleaner.output.tail_pb final.output.concentrate_pb final.output.tail_pb
count 22472.000000 22618.000000 22618.000000 22268.000000 22594.000000 22600.000000 22629.000000 22516.000000
mean 3.305676 6.900646 0.593620 8.921110 3.175822 5.304107 9.095308 2.488252
std 1.446905 2.806948 0.315295 3.706314 1.652177 3.092536 3.230797 1.189407
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2.658814 6.374692 0.427513 7.834690 2.269103 3.451435 8.750171 1.805376
50% 3.432054 7.572855 0.590746 9.921116 3.154022 5.074145 9.914519 2.653001
75% 4.204960 8.477358 0.763219 11.266100 4.123574 7.585281 10.929839 3.287790
max 7.142594 18.394042 3.778064 17.081278 9.634565 17.042505 17.031899 6.086532

Compare (Au, Ag, Pb): Original Dataset¶

In [134]:
# Box plot - stages
gold_full[all_metal_stages].plot(kind='box', figsize = (12,8), vert = False, title = "All Metals (Au, Ag, Pb) Concentration by Stage", grid = True)

# Line plot - Median Stages


au_cols = ['Rougher Feed (Au)','Rougher Conc(Au)','Rougher Tail(Au)',
          'Primary Cleaner Conc(Au)','Primary Cleaner Tail(Au)','Secondary Cleaner Tail(Au)','Final Conc(Au)','Final Tail(Au)']

ag_cols = ['Rougher Feed (Ag)','Rougher Conc(Ag)','Rougher Tail(Ag)',
          'Primary Cleaner Conc(Ag)','Primary Cleaner Tail(Ag)','Secondary Cleaner Tail(Ag)','Final Conc(Ag)','Final Tail(Ag)']

pb_cols = ['Rougher Feed (Pb)','Rougher Conc(Pb)','Rougher Tail(Pb)',
          'Primary Cleaner Conc(Pb)','Primary Cleaner Tail(Pb)','Secondary Cleaner Tail(Pb)','Final Conc(Pb)','Final Tail(Pb)']


plt.figure(figsize=(12,8))
plt.plot(au_cols, au_median, marker = 'o', label = 'Au median')
plt.plot(ag_cols, ag_median, marker = 'o', label = 'Ag median')
plt.plot(pb_cols, pb_median, marker = 'o', label = 'Pb median')
plt.title("All Metals (Ag, Au, Pb) Concentration by Stage")
plt.ylabel('Concentration')
plt.xticks(rotation = 90)
plt.grid(True)
plt.legend()
plt.show()



# Show stats
with pd.option_context('display.max_columns', None):
    display(gold_full[all_metal_stages].describe())
No description has been provided for this image
No description has been provided for this image
rougher.input.feed_au rougher.output.concentrate_au rougher.output.tail_au primary_cleaner.output.concentrate_au primary_cleaner.output.tail_au secondary_cleaner.output.tail_au final.output.concentrate_au final.output.tail_au rougher.input.feed_ag rougher.output.concentrate_ag rougher.output.tail_ag primary_cleaner.output.concentrate_ag primary_cleaner.output.tail_ag secondary_cleaner.output.tail_ag final.output.concentrate_ag final.output.tail_ag rougher.input.feed_pb rougher.output.concentrate_pb rougher.output.tail_pb primary_cleaner.output.concentrate_pb primary_cleaner.output.tail_pb secondary_cleaner.output.tail_pb final.output.concentrate_pb final.output.tail_pb
count 22617.000000 22618.000000 19980.000000 22618.000000 22617.000000 22618.000000 22630.000000 22635.000000 22618.000000 22618.000000 19979.000000 22618.000000 22614.000000 22616.000000 22627.000000 22633.000000 22472.000000 22618.000000 22618.000000 22268.000000 22594.000000 22600.000000 22629.000000 22516.000000
mean 7.565838 17.879538 1.821193 29.212289 3.670333 4.041218 40.001172 2.827459 8.065715 10.874484 5.587861 7.691652 14.876219 13.375349 4.781559 8.923690 3.305676 6.900646 0.593620 8.921110 3.175822 5.304107 9.095308 2.488252
std 3.026954 6.790112 0.695663 10.539303 1.985206 2.605738 13.398062 1.262834 3.125250 4.377924 1.114614 3.109306 5.654342 5.768719 2.030128 3.517917 1.446905 2.806948 0.315295 3.706314 1.652177 3.092536 3.230797 1.189407
min 0.000000 0.000000 0.020676 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.594562 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 6.485009 17.928729 1.403951 29.374406 2.741534 2.877554 42.383721 2.303108 6.944415 10.126646 4.887758 6.771102 13.464756 11.802936 4.018525 7.684016 2.658814 6.374692 0.427513 7.834690 2.269103 3.451435 8.750171 1.805376
50% 7.884832 20.003202 1.808567 32.359813 3.513008 3.956171 44.653436 2.913794 8.302613 11.785127 5.759630 8.265643 15.600454 15.222165 4.953729 9.484369 3.432054 7.572855 0.590746 9.921116 3.154022 5.074145 9.914519 2.653001
75% 9.668064 21.564238 2.215317 34.770726 4.559485 5.006944 46.111999 3.555077 10.135202 13.615299 6.386301 9.697896 17.931084 17.231611 5.862593 11.084557 4.204960 8.477358 0.763219 11.266100 4.123574 7.585281 10.929839 3.287790
max 14.093363 28.824507 9.688980 45.933934 18.528821 26.811643 53.611374 9.789625 14.869652 24.480271 12.719177 16.081632 29.459575 23.264738 16.001945 19.552149 7.142594 18.394042 3.778064 17.081278 9.634565 17.042505 17.031899 6.086532

Original Dataset: Outlier Analysis and Distribution Assessment

Analysis of the original dataset (pre-imputation) reveals significant outliers and wide variance across metal concentration measurements, particularly in gold (Au) processing stages. This assessment examines the raw data distribution before any imputation or outlier handling.


Distribution Summary by Metal Type

Gold (Au) Concentrations:

image.png

Gold (Au) - Extreme outliers and process anomalies:

  • Primary Cleaner Concentrate Au: Maximum 45.93 with high variance (±10.54), showing 41% above 75th percentile (34.77)
  • Secondary Cleaner Tail Au: Maximum 26.81, far exceeding expected near-depletion levels (median 3.96)
  • Final Concentrate Au: Range 0-53.61 represents massive spread; zero values indicate complete process failures
  • Rougher Tail Au: Maximum 9.69 is 5x the median (1.81), suggesting occasional poor separation
Stage Median Mean Std Dev Min Max Key Observation

| Rougher Feed | 7.88 | 7.57 | 3.03 | 0.00 | 14.09 | Baseline input | | Rougher Concentrate | 20.00 | 17.88 | 6.79 | 0.00 | 28.82 | 2.5x concentration from feed | | Rougher Tail | 1.81 | 1.82 | 0.70 | 0.02 | 9.69 | Low concentration, good separation | | Primary Cleaner Concentrate | 32.36 | 29.21 | 10.54 | 0.00 | 45.93 | Highest variance (±10.54) | | Primary Cleaner Tail | 3.51 | 3.67 | 1.99 | 0.00 | 18.53 | Moderate loss in tail | | Secondary Cleaner Tail | 3.96 | 4.04 | 2.61 | 0.00 | 26.81 | Higher variance than primary tail | | Final Concentrate | 44.65 | 40.00 | 13.40 | 0.00 | 53.61 | Maximum enrichment achieved (5.7x feed) | | Final Tail | 2.91 | 2.83 | 1.26 | 0.00 | 9.79 | Minimal loss, efficient recovery |

image.png

Silver (Ag) Concentrations:

image.png

Silver (Ag) - Inverted concentration patterns indicate data quality issues:

  • Primary Cleaner stages show inverted relationship: Tail (15.60) > Concentrate (8.27), opposite of expected behavior
  • Final Concentrate Ag (4.95) is LOWER than feed (8.30): Indicates silver rejection, not concentration
  • Final Tail Ag (9.48) exceeds final concentrate: Confirms poor silver recovery throughout process
  • Primary Cleaner Tail maximum (29.46): Extreme outlier nearly 2x the median (15.60)
Stage Median Mean Std Dev Min Max Key Observation
Rougher Feed 8.30 8.07 3.13 0.00 14.87 Baseline input
Rougher Concentrate 11.79 10.87 4.38 0.00 24.48 1.4x concentration from feed
Rougher Tail 5.76 5.59 1.11 0.59 12.72 Higher than Au tail (less efficient separation)
Primary Cleaner Concentrate 8.27 7.69 3.11 0.00 16.08 Lower than rougher concentrate (unusual)
Primary Cleaner Tail 15.60 14.88 6.54 0.00 29.46 Higher than concentrate (inverted pattern)
Secondary Cleaner Tail 15.22 13.38 5.77 0.00 23.26 Similar to primary cleaner tail
Final Concentrate 4.95 4.78 2.03 0.00 16.00 Low enrichment (~0.6x feed)
Final Tail 9.48 8.92 3.52 0.00 19.55 Higher than concentrate (poor recovery)

image.png

Lead (Pb) Concentrations:

image.png

Lead (Pb) - Moderate outliers with proper concentration trend:

  • Rougher Concentrate Pb: Maximum 18.39 is 2.4x the median (7.57), highest relative outlier
  • Secondary Cleaner Tail Pb: Maximum 17.04 is 3.4x the median (5.07), indicating occasional heavy losses
  • Overall pattern is correct: Concentrate > Tail at each stage, unlike silver
Stage Median Mean Std Dev Min Max Key Observation
Rougher Feed 3.43 3.31 1.45 0.00 7.14 Baseline input (lowest of 3 metals)
Rougher Concentrate 7.57 6.90 2.81 0.00 18.39 2.2x concentration from feed
Rougher Tail 0.59 0.59 0.32 0.00 3.78 Excellent separation (82% reduction)
Primary Cleaner Concentrate 9.92 8.92 3.71 0.00 17.08 Further enrichment to 3x feed
Primary Cleaner Tail 3.15 3.18 1.65 0.00 9.63 Moderate loss in tail
Secondary Cleaner Tail 5.07 5.30 3.09 0.00 17.04 Higher than primary tail
Final Concentrate 9.91 9.10 3.23 0.00 17.03 Maximum enrichment (2.9x feed)
Final Tail 2.65 2.49 1.19 0.00 6.09 Low loss, good overall recovery

image.png


Critical Outlier Observations

Zero values across all metals:

  • Present at minimum for nearly all stages (Au, Ag, Pb)
  • Particularly concerning in concentrate stages where zeros indicate complete process failure
  • May represent measurement errors, sensor failures, or true process shutdowns

Variance patterns:

  • Gold: Highest absolute variance (std dev 10.54-13.40) in cleaner concentrate stages
  • Silver: Primary cleaner tail shows highest relative variance (std dev 6.54, 44% of median 14.88)
  • Lead: Most stable relative to median, except secondary cleaner tail (std dev 3.09, 61% of median 5.07)

Recommendations

  • Investigate zero values for potential data or process errors.
  • Validate silver data — inverted trends likely indicate mislabeling or intentional rejection.
  • Apply robust scaling or log transforms to handle wide value ranges and outliers.
  • Flag abnormal tails (e.g., Au secondary cleaner tail) as potential inefficiencies.

Imputed Dataset¶

In [135]:
all_imp = gold_full_new2[['rougher.input.feed_au','rougher.input.feed_ag','rougher.input.feed_pb']]
In [136]:
# Box plot - stages
all_imp.plot(kind='box', vert = False,figsize=(10, 5), 
             title='All Metals (Au, Ag, Pb) Concentration for Rougher Feed: Imputed Dataset', grid = True)

gold_full[['rougher.input.feed_au','rougher.input.feed_ag','rougher.input.feed_pb']].plot(
    kind='box', vert = False, figsize = (10,5), 
    title = "All Metals (Au, Ag, Pb) Concentration for Rougher Feed: Original Dataset", grid = True)

# Show stats
all_imp.describe()
Out[136]:
rougher.input.feed_au rougher.input.feed_ag rougher.input.feed_pb
count 14336.000000 14336.000000 14336.000000
mean 8.014002 8.705637 3.571561
std 1.944873 1.984905 1.110681
min 0.010000 0.010000 0.010000
25% 6.668732 7.173804 2.802313
50% 7.767332 8.278239 3.467893
75% 9.266969 10.086486 4.300985
max 13.899559 14.596026 7.142594
No description has been provided for this image
No description has been provided for this image

Original Dataset Vs. Imputed Dataset

Metal Original Median Imputed Median Original Mean Imputed Mean Original Std Imputed Std Original Min Imputed Min Original Max Imputed Max Notes
Au (Gold) 7.88 7.77 7.57 8.01 3.03 1.94 0.00 0.01 14.09 13.90 Less spread and no zeros after imputation — good
Ag (Silver) 8.30 8.28 8.07 8.71 3.13 1.98 0.00 0.01 14.87 14.6 Same pattern — realistic floor now
Pb (Lead) 3.43 3.47 3.31 3.57 1.45 1.11 0.00 0.01 7.14 7.14 Minor tightening — variance reduced

Conclusion

The rougher feed concentrations of Au, Ag, and Pb were compared between the original and imputed datasets. The original data contained several zeros, likely due to missing sensor readings. After imputation, the minimum values increased to 0.01, standard deviations decreased slightly, and overall means remained consistent — indicating improved data integrity without distorting the underlying distributions. Outliers were retained, as they likely reflect genuine variations in ore composition rather than measurement errors.

Compare feed_size: Train vs. Test¶

In [137]:
# Compare the feed particle size distributions in the training set and in the test set. 
# If the distributions vary significantly, the model evaluation will be incorrect.

display(gold_train['rougher.input.feed_size'].describe())
display(gold_test['rougher.input.feed_size'].describe())

plt.figure(figsize=(10,5))
gold_train['rougher.input.feed_size'].plot(kind = 'kde', alpha = 0.7, label = 'Train_kde')
gold_test['rougher.input.feed_size'].plot(kind = 'kde', alpha = 0.7, label = 'Test_kde')
plt.legend()
plt.title("Feed Particle Size Distribution: Train vs Test (Original Dataset)")
plt.show()

plt.figure(figsize=(10,5))
gold_train_new2['rougher.input.feed_size'].plot(kind = 'kde', alpha = 0.7, label = 'Train_kde')
gold_test_new2['rougher.input.feed_size'].plot(kind = 'kde', alpha = 0.7, label = 'Test_kde')
plt.legend()
plt.title("Feed Particle Size Distribution: Train vs Test (Imputed Dataset)")
plt.show()

display(gold_train_new2['rougher.input.feed_size'].describe())
display(gold_test_new2['rougher.input.feed_size'].describe())
count    16443.000000
mean        58.676444
std         23.922591
min          9.659576
25%         47.575879
50%         54.104257
75%         65.051064
max        484.967466
Name: rougher.input.feed_size, dtype: float64
count    5834.000000
mean       55.937535
std        22.724254
min         0.046369
25%        43.890852
50%        50.002004
75%        61.638434
max       477.445473
Name: rougher.input.feed_size, dtype: float64
No description has been provided for this image
No description has been provided for this image
count    14336.000000
mean        59.916221
std         22.384221
min          9.659576
25%         48.873485
50%         55.253794
75%         65.994741
max        484.967466
Name: rougher.input.feed_size, dtype: float64
count    5856.000000
mean       55.934966
std        22.681559
min         0.046369
25%        43.898467
50%        50.109024
75%        61.608216
max       477.445473
Name: rougher.input.feed_size, dtype: float64

Feed_Size Distribution: Train Vs. Test

Dataset Train Mean Test Mean Train Median Test Median Std Diff Notes
Original 58.68 55.94 54.10 50.00 +1.2 More “natural,” small realistic difference
Imputed 59.92 55.93 55.25 50.11 +0.3 Slightly higher mean in train, but similar pattern

Conclusion

Both the original and imputed datasets were evaluated for feed size distribution consistency between the training and test sets. The original dataset shows slightly more natural variation, while imputation slightly increases the mean feed size in the training set due to smoothing of missing values. However, both datasets maintain comparable distribution shapes and ranges, confirming that train–test distributions are sufficiently aligned for model evaluation.

Total Concentration Sanity Check¶

In [138]:
# Consider the total concentrations of all substances at different stages: 
# raw feed, rougher concentrate, and final concentrate. Do you notice any abnormal values in the total distribution? 
# If you do, is it worth removing such values from both samples? Describe the findings and eliminate anomalies.

conc = gold_full[['rougher.input.feed_au', 'rougher.output.concentrate_au', 'final.output.concentrate_au','rougher.input.feed_ag', 'rougher.output.concentrate_ag', 
                  'final.output.concentrate_ag','rougher.input.feed_pb', 'rougher.output.concentrate_pb', 'final.output.concentrate_pb']].copy()

conc['total_concentration_feed'] = conc['rougher.input.feed_au'] + conc['rougher.input.feed_ag'] + conc['rougher.input.feed_pb']
conc['total_concentration_rougher'] = conc['rougher.output.concentrate_au'] + conc['rougher.output.concentrate_ag'] + conc['rougher.output.concentrate_pb']
conc['total_concentration_final'] = conc['final.output.concentrate_au'] + conc['final.output.concentrate_ag'] + conc['final.output.concentrate_pb']

conc_imp = gold_full_new2[['rougher.input.feed_au','rougher.input.feed_ag','rougher.input.feed_pb']].copy()
conc_imp['total_concentration_feed'] = conc_imp['rougher.input.feed_au'] + conc_imp['rougher.input.feed_ag'] + conc_imp['rougher.input.feed_pb']

display(conc)
display(conc_imp)


display(conc['total_concentration_feed'].describe())
display(conc['total_concentration_rougher'].describe())
display(conc['total_concentration_final'].describe())
print()
print("Original Dataset")
print("-----------------")
conc[['total_concentration_feed','total_concentration_rougher','total_concentration_final']].hist(figsize = (10,5), label = "Total Concentration: Original Dataset")
plt.show()
print()
print()
print("Imputed Dataset (total_concentration_feed)")
print("-------------------------------------------")
conc_imp['total_concentration_feed'].hist(figsize=(10,5))
plt.show()
display(conc_imp['total_concentration_feed'].describe())
rougher.input.feed_au rougher.output.concentrate_au final.output.concentrate_au rougher.input.feed_ag rougher.output.concentrate_ag final.output.concentrate_ag rougher.input.feed_pb rougher.output.concentrate_pb final.output.concentrate_pb total_concentration_feed total_concentration_rougher total_concentration_final
0 6.486150 19.793808 42.192020 6.100378 11.500771 6.055403 2.284912 7.101074 9.889648 14.871440 38.395653 58.137072
1 6.478583 20.050975 42.701629 6.161113 11.615865 6.029369 2.266033 7.278807 9.968944 14.905729 38.945647 58.699942
2 6.362222 19.737170 42.657501 6.116455 11.695753 6.055926 2.159622 7.216833 10.213995 14.638299 38.649756 58.927421
3 6.118189 19.320810 42.689819 6.043309 11.915047 6.047977 2.037807 7.175616 9.977019 14.199305 38.411473 58.714815
4 5.663707 19.216101 42.774141 6.060915 12.411054 6.148599 1.786875 7.240205 10.142511 13.511497 38.867359 59.065251
... ... ... ... ... ... ... ... ... ... ... ... ...
22711 5.335862 18.603550 46.713954 6.091855 11.124896 3.224920 4.617558 10.984003 11.356233 16.045275 40.712449 61.295107
22712 4.838619 18.441436 46.866780 6.121323 11.425983 3.195978 4.144989 10.888213 11.349355 15.104931 40.755632 61.412113
22713 4.525061 15.111231 46.795691 5.970515 8.523497 3.109998 4.020002 8.955135 11.434366 14.515579 32.589863 61.340054
22714 4.362781 17.834772 46.408188 6.048130 11.658799 3.367241 3.902537 10.655377 11.625587 14.313448 40.148948 61.401016
22715 4.365491 17.804134 46.299438 6.158718 11.959486 3.598375 3.875727 10.702148 11.737832 14.399936 40.465768 61.635645

22716 rows × 12 columns

rougher.input.feed_au rougher.input.feed_ag rougher.input.feed_pb total_concentration_feed
0 6.486150 6.100378 2.284912 14.871440
1 6.478583 6.161113 2.266033 14.905729
2 6.362222 6.116455 2.159622 14.638299
3 6.118189 6.043309 2.037807 14.199305
4 5.663707 6.060915 1.786875 13.511497
... ... ... ... ...
16855 5.335862 6.091855 4.617558 16.045275
16856 4.838619 6.121323 4.144989 15.104931
16857 4.525061 5.970515 4.020002 14.515579
16858 4.362781 6.048130 3.902537 14.313448
16859 4.365491 6.158718 3.875727 14.399936

14336 rows × 4 columns

count    22471.000000
mean        18.985914
std          7.300593
min          0.000000
25%         16.553308
50%         19.629877
75%         23.618515
max         35.071987
Name: total_concentration_feed, dtype: float64
count    22618.000000
mean        35.654668
std         13.224242
min          0.000000
25%         37.382512
50%         39.979226
75%         42.192901
max         55.568687
Name: total_concentration_rougher, dtype: float64
count    22627.000000
mean        53.881912
std         17.697706
min          0.000000
25%         58.706155
50%         60.081820
75%         60.993449
max         65.575259
Name: total_concentration_final, dtype: float64
Original Dataset
-----------------
No description has been provided for this image

Imputed Dataset (total_concentration_feed)
-------------------------------------------
No description has been provided for this image
count    14336.000000
mean        20.291200
std          4.591283
min          0.030000
25%         16.999213
50%         19.437493
75%         23.087114
max         34.830220
Name: total_concentration_feed, dtype: float64

Total Concentration Analysis

Original Dataset

Stage Count Mean Std Dev Min 25% 50% (Median) 75% Max
Feed 22,471 18.99 7.30 0.00 16.55 19.63 23.62 35.07
Rougher Concentrate 22,618 35.65 13.22 0.00 37.38 39.98 42.19 55.57
Final Concentrate 22,627 53.88 17.70 0.00 58.71 60.08 60.99 65.58

Observations:

  • The original dataset contains zero values in all stages, which may indicate missing or faulty sensor readings.
  • The total concentrations increase logically from feed → rougher → final, but extreme lows (zeros) are anomalous.

Imputed Dataset

Stage Count Mean Std Dev Min 25% 50% (Median) 75% Max
Feed 14,336 20.29 4.59 0.03 17.00 19.44 23.09 34.83

Observations:

  • After imputation, the feed stage has no zeros; the minimum increased to 0.03.
  • Standard deviation decreased slightly, indicating a tighter distribution.
  • The total feed concentration is more realistic and suitable for modeling without distorting the overall distribution.

Conclusion

Total concentrations at different processing stages were analyzed. The original dataset contained zero and extreme values, particularly in concentrate stages. For modeling, only rougher feed concentrations will be used. In the imputed dataset, zeros in rougher feed were replaced, reducing variance slightly, while preserving meaningful outliers. Other stages were not corrected, but anomalies were noted for process understanding.

Build the Model¶

In [139]:
# Separate date column by year and month
gold_full_new3 = gold_full_new2.copy()
gold_full_new3['year'] = gold_full_new3['date'].dt.year
gold_full_new3['month'] = gold_full_new3['date'].dt.month


year_df = gold_full_new3[gold_full_new3['year'] == 2016]
year_df1 = gold_full_new3[gold_full_new3['year'] == 2017]
year_df2 = gold_full_new3[gold_full_new3['year'] == 2018]

year_df.groupby('month')['final.output.recovery'].mean().plot(kind='line', grid = True, label = '2016')
year_df1.groupby('month')['final.output.recovery'].mean().plot(kind='line', grid = True, label = '2017')
year_df2.groupby('month')['final.output.recovery'].mean().plot(kind='line', grid = True, label = '2018')
plt.title('Average Recovery by Month')
plt.ylabel('final.output.recovery')
plt.legend()
plt.show()

gold_full_new3 = gold_full_new3.drop(columns = ['date'])


# Do the same for the training set and the test set
gold_train_new3 = gold_train_new2.copy()
gold_train_new3['year'] = gold_train_new3['date'].dt.year
gold_train_new3['month'] = gold_train_new3['date'].dt.month
gold_train_new3 = gold_train_new3.drop(columns = ['date'])

gold_test_new3 = gold_test_new2.copy()
gold_test_new3['year'] = gold_test_new3['date'].dt.year
gold_test_new3['month'] = gold_test_new3['date'].dt.month
gold_test_new3 = gold_test_new3.drop(columns = ['date'])
No description has been provided for this image
In [140]:
# Create a temporary DF with rougher.output.recovery to do the sMAPE calculation
train_w_rougher = gold_train_new3.merge(
    gold_train[['rougher.output.recovery']],
    left_index=True,
    right_index=True,
    how='left'
)
train_w_rougher = train_w_rougher.dropna(subset=['rougher.output.recovery'])

train_w_rougher.isna().sum()

train_w_rougher
Out[140]:
final.output.recovery primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level primary_cleaner.state.floatbank8_c_air ... secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level year month rougher.output.recovery
0 70.541216 127.092003 10.128295 7.25 0.988759 1549.775757 -498.912140 1551.434204 -516.403442 1549.873901 ... -504.715942 9.925633 -498.310211 8.079666 -500.470978 14.151341 -605.841980 2016 1 87.107763
1 69.266198 125.629232 10.296251 7.25 1.002663 1576.166671 -500.904965 1575.950626 -499.865889 1575.994189 ... -501.331529 10.039245 -500.169983 7.984757 -500.582168 13.998353 -599.787184 2016 1 86.843261
2 68.116445 123.819808 11.316280 7.25 0.991265 1601.556163 -499.997791 1600.386685 -500.607762 1602.003542 ... -501.133383 10.070913 -500.129135 8.013877 -500.517572 14.028663 -601.427363 2016 1 86.842308
3 68.347543 122.270188 11.322140 7.25 0.996739 1599.968720 -500.951778 1600.659236 -499.677094 1600.304144 ... -501.193686 9.970366 -499.201640 7.977324 -500.255908 14.005551 -599.996129 2016 1 87.226430
4 66.927016 117.988169 11.913613 7.25 1.009869 1601.339707 -498.975456 1601.437854 -500.323246 1599.581894 ... -501.053894 9.925709 -501.686727 7.894242 -500.356035 13.996647 -601.496691 2016 1 86.688794
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
16855 73.755150 123.381787 8.028927 6.50 1.304232 1648.421193 -400.382169 1648.742005 -400.359661 1648.578230 ... -499.740028 18.006038 -499.834374 13.001114 -500.155694 20.007840 -501.296428 2018 8 89.574376
16856 69.049291 120.878188 7.962636 6.50 1.302419 1649.820162 -399.930973 1649.357538 -399.721222 1648.656192 ... -500.251357 17.998535 -500.395178 12.954048 -499.895163 19.968498 -501.041608 2018 8 87.724007
16857 67.002189 105.666118 7.955111 6.50 1.315926 1649.166761 -399.888631 1649.196904 -399.677571 1647.896999 ... -499.857027 18.019543 -500.451156 13.023431 -499.914391 19.990885 -501.518452 2018 8 88.890579
16858 65.523246 98.880538 7.984164 6.50 1.241969 1646.547763 -398.977083 1648.212240 -400.383265 1648.917387 ... -500.314711 17.979515 -499.272871 12.992404 -499.976268 20.013986 -500.625471 2018 8 89.858126
16859 70.281454 95.248427 8.078957 6.50 1.283045 1648.759906 -399.862053 1650.135395 -399.957321 1648.831890 ... -500.220296 17.963512 -499.939490 12.990306 -500.080993 19.990336 -499.191575 2018 8 89.514960

13582 rows × 56 columns

symmetric Mean Absolute Percentage Error¶

In [141]:
# Write a function to calculate the final sMAPE value.

def final_smape(df_train_final, final_column, df_train_rougher, rougher_column, df_test, 
                model_final = LinearRegression(), model_rougher = LinearRegression()):

    # Prepare features and targets
    features_train_final = df_train_final.drop([final_column], axis = 1)
    target_train_final = df_train_final[final_column]

    features_train_rougher = df_train_rougher.drop([rougher_column, final_column], axis = 1)
    target_train_rougher = df_train_rougher[rougher_column]
    
    # Fit models
    model_final.fit(features_train_final,target_train_final)
    model_rougher.fit(features_train_rougher,target_train_rougher)

    
    # Predict
    predicted_test_final = model_final.predict(df_test)
    predicted_test_rougher = model_rougher.predict(df_test)

    # Rename for formula
    y_true_final = target_train_final
    y_true_rougher = target_train_rougher

    y_pred_final = predicted_test_final
    y_pred_rougher = predicted_test_rougher

    # Smape Calculation
    smape_final = np.mean(((np.abs(y_true_final[:len(y_pred_final)] - y_pred_final)) / 
                           ((np.abs(y_true_final[:len(y_pred_final)]) + np.abs(y_pred_final)) / 2))) * 100
    smape_rougher = np.mean(((np.abs(y_true_rougher[:len(y_pred_rougher)] - y_pred_rougher)) / 
                             ((np.abs(y_true_rougher[:len(y_pred_rougher)]) + np.abs(y_pred_rougher)) / 2))) * 100


    # Final Smape
    return 0.25 * smape_rougher + 0.75 * smape_final
In [142]:
# Get the final_smape

# Use Linear Regression (default)
print("Linear Regression Model: Final sMAPE")
print("-------------------------------------")
display(final_smape(gold_train_new3,'final.output.recovery', train_w_rougher, 'rougher.output.recovery', 
            gold_test_new3, model_final = LinearRegression(), model_rougher = LinearRegression()))
print()
print()
# Use DecisionTreeRegressor
print("Decision Tree Model: Final sMAPE")
print("-------------------------------------")
display(final_smape(gold_train_new3,'final.output.recovery', train_w_rougher, 'rougher.output.recovery', 
            gold_test_new3, model_final = DecisionTreeRegressor(random_state=1234), model_rougher = DecisionTreeRegressor(random_state=12345)))
print()
print()
# Use RandomForestRegressor
print("Random Forest Model: Final sMAPE")
print("-------------------------------------")
display(final_smape(gold_train_new3,'final.output.recovery', train_w_rougher, 'rougher.output.recovery', 
            gold_test_new3, model_final = RandomForestRegressor(random_state=1234), model_rougher = RandomForestRegressor(random_state=1234)))
Linear Regression Model: Final sMAPE
-------------------------------------
14.63521422380902

Decision Tree Model: Final sMAPE
-------------------------------------
22.48113187928036

Random Forest Model: Final sMAPE
-------------------------------------
14.298332287669346
Model Final sMAPE (%) Interpretation
Linear Regression 14.64 Achieved strong predictive accuracy, indicating relatively low average percentage error between predicted and actual recovery values.
Decision Tree 22.48 Performed noticeably worse than other models, suggesting higher variance or overfitting to the training data.
Random Forest 14.30 Produced the lowest sMAPE, close to Linear Regression, indicating robust and reliable performance.

Summary

Among the three models tested, Random Forest achieved the lowest sMAPE (14.30%), narrowly outperforming Linear Regression (14.64%), while the Decision Tree showed significantly higher error (22.48%).

This suggests that both Random Forest and Linear Regression are effective models for predicting gold recovery, with Random Forest offering slightly better accuracy and likely better generalization to unseen data. The Decision Tree model, while simpler, likely overfits and does not generalize as well to the test data.

Evaluating Models Using Cross-Validation¶

In [143]:
# Train different models. Evaluate them using cross-validation. Pick the best model and test it using the test sample.

# Define features and target
features_train_final = gold_train_new3.drop(['final.output.recovery'], axis=1)
target_train_final = gold_train_new3['final.output.recovery']

# Define models
model_rf = RandomForestRegressor(random_state=1234)
model_lr = LinearRegression()

# Perform 5-fold cross-validation
print("Random Forest Model: Cross-validating...")
scores_rf = cross_val_score(model_rf, features_train_final, target_train_final, cv=5, scoring='neg_mean_absolute_error')
print("Linear Regression Model: Cross-validating...")
scores_lr = cross_val_score(model_lr, features_train_final, target_train_final, cv=5, scoring='neg_mean_absolute_error')

# Convert from negative (since sklearn uses neg metrics for errors)
final_score_rf = -np.mean(scores_rf)
final_score_lr = -np.mean(scores_lr)


print('Average Cross-Validation Score (Random Forest):', final_score_rf)
print('Average Cross-Validation Score (Linear Regression):', final_score_lr)
Random Forest Model: Cross-validating...
Linear Regression Model: Cross-validating...
Average Cross-Validation Score (Random Forest): 6.262558587624503
Average Cross-Validation Score (Linear Regression): 6.8606775930853985

Model Evaluation Summary

Cross-Validation Results (5-Fold MAE)
Model Average Cross-Validation MAE
Random Forest 6.26
Linear Regression 6.86
Decision Tree N/A (not tested / worse)

Summary

The predictive performance of different models was evaluated using both sMAPE and 5-fold cross-validation with mean absolute error.

Random Forest: Produced the lowest sMAPE (14.30%) and the lowest average cross-validation MAE (6.26), indicating strong predictive accuracy and robustness.

Linear Regression: Achieved similar performance, with slightly higher sMAPE (14.64%) and MAE (6.86), suggesting it is also a reliable model, though marginally less accurate than Random Forest.

Decision Tree: Performed noticeably worse, with higher sMAPE (22.48%), indicating higher variance and overfitting risk.

Overall, Random Forest is the top-performing model, though Linear Regression remains a strong alternative. Decision Tree is not recommended due to its comparatively poorer accuracy.

In [144]:
# Since rougher and final outputs were never in the test set, make DF's for the test set that align with the true targets
gold_full['date'] = pd.to_datetime(gold_full['date'])

# Create a test targets DF for sMAPE and predictions
test_dates = gold_test_new2['date']

test_w_targets = gold_full[gold_full['date'].isin(test_dates)][
    ['final.output.recovery', 'rougher.output.recovery']
].reset_index(drop=True)


# Merge targets with test features
gold_test_w_targets = pd.concat([gold_test_new3.reset_index(drop=True), test_w_targets], axis=1)

# Drop NaN rows for the missing targets
gold_test_w_true_targets_all = gold_test_w_targets.dropna()

# Get the true targets only
gold_test_w_true_targets_only = gold_test_w_true_targets_all[['final.output.recovery','rougher.output.recovery']]

# Get the true features only
gold_test_w_true_features = gold_test_w_true_targets_all.drop(columns = ['final.output.recovery','rougher.output.recovery'])

display(gold_test_w_true_features)
gold_test_new3
primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level primary_cleaner.state.floatbank8_c_air primary_cleaner.state.floatbank8_c_level ... secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level year month
0 210.800909 14.993118 8.080000 1.005021 1398.981301 -500.225577 1399.144926 -499.919735 1400.102998 -500.704369 ... 8.016656 -501.289139 7.946562 -432.317850 4.872511 -500.037437 26.705889 -499.709414 2016 9
1 215.392455 14.987471 8.080000 0.990469 1398.777912 -500.057435 1398.055362 -499.778182 1396.151033 -499.240168 ... 8.130979 -499.634209 7.958270 -525.839648 4.878850 -500.162375 25.019940 -499.819438 2016 9
2 215.259946 12.884934 7.786667 0.996043 1398.493666 -500.868360 1398.860436 -499.764529 1398.075709 -502.151509 ... 8.096893 -500.827423 8.071056 -500.801673 4.905125 -499.828510 24.994862 -500.622559 2016 9
3 215.336236 12.006805 7.640000 0.863514 1399.618111 -498.863574 1397.440120 -499.211024 1400.129303 -498.355873 ... 8.074946 -499.474407 7.897085 -500.868509 4.931400 -499.963623 24.948919 -498.709987 2016 9
4 199.099327 10.682530 7.530000 0.805575 1401.268123 -500.808305 1398.128818 -499.504543 1402.172226 -500.810606 ... 8.054678 -500.397500 8.107890 -509.526725 4.957674 -500.360026 25.003331 -500.856333 2016 9
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5851 173.957757 15.963399 8.070000 0.896701 1401.930554 -499.728848 1401.441445 -499.193423 1399.810313 -499.599127 ... 12.069155 -499.673279 7.977259 -499.516126 5.933319 -499.965973 8.987171 -499.755909 2017 12
5852 172.910270 16.002605 8.070000 0.896519 1447.075722 -494.716823 1448.851892 -465.963026 1443.890424 -503.587739 ... 13.365371 -499.122723 9.288553 -496.892967 7.372897 -499.942956 8.986832 -499.903761 2017 12
5853 171.135718 15.993669 8.070000 1.165996 1498.836182 -501.770403 1499.572353 -495.516347 1502.749213 -520.667442 ... 15.101425 -499.936252 10.989181 -498.347898 9.020944 -500.040448 8.982038 -497.789882 2017 12
5854 179.697158 15.438979 8.070000 1.501068 1498.466243 -500.483984 1497.986986 -519.200340 1496.569047 -487.479567 ... 15.026853 -499.723143 11.011607 -499.985046 9.009783 -499.937902 9.012660 -500.154284 2017 12
5855 181.556856 14.995850 8.070000 1.623454 1498.096303 -499.796922 1501.743791 -505.146931 1499.535978 -492.428226 ... 14.914199 -499.948518 10.986607 -500.658027 8.989497 -500.337588 8.988632 -500.764937 2017 12

5290 rows × 54 columns

Out[144]:
primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level primary_cleaner.state.floatbank8_c_air primary_cleaner.state.floatbank8_c_level ... secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level year month
0 210.800909 14.993118 8.080000 1.005021 1398.981301 -500.225577 1399.144926 -499.919735 1400.102998 -500.704369 ... 8.016656 -501.289139 7.946562 -432.317850 4.872511 -500.037437 26.705889 -499.709414 2016 9
1 215.392455 14.987471 8.080000 0.990469 1398.777912 -500.057435 1398.055362 -499.778182 1396.151033 -499.240168 ... 8.130979 -499.634209 7.958270 -525.839648 4.878850 -500.162375 25.019940 -499.819438 2016 9
2 215.259946 12.884934 7.786667 0.996043 1398.493666 -500.868360 1398.860436 -499.764529 1398.075709 -502.151509 ... 8.096893 -500.827423 8.071056 -500.801673 4.905125 -499.828510 24.994862 -500.622559 2016 9
3 215.336236 12.006805 7.640000 0.863514 1399.618111 -498.863574 1397.440120 -499.211024 1400.129303 -498.355873 ... 8.074946 -499.474407 7.897085 -500.868509 4.931400 -499.963623 24.948919 -498.709987 2016 9
4 199.099327 10.682530 7.530000 0.805575 1401.268123 -500.808305 1398.128818 -499.504543 1402.172226 -500.810606 ... 8.054678 -500.397500 8.107890 -509.526725 4.957674 -500.360026 25.003331 -500.856333 2016 9
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5851 173.957757 15.963399 8.070000 0.896701 1401.930554 -499.728848 1401.441445 -499.193423 1399.810313 -499.599127 ... 12.069155 -499.673279 7.977259 -499.516126 5.933319 -499.965973 8.987171 -499.755909 2017 12
5852 172.910270 16.002605 8.070000 0.896519 1447.075722 -494.716823 1448.851892 -465.963026 1443.890424 -503.587739 ... 13.365371 -499.122723 9.288553 -496.892967 7.372897 -499.942956 8.986832 -499.903761 2017 12
5853 171.135718 15.993669 8.070000 1.165996 1498.836182 -501.770403 1499.572353 -495.516347 1502.749213 -520.667442 ... 15.101425 -499.936252 10.989181 -498.347898 9.020944 -500.040448 8.982038 -497.789882 2017 12
5854 179.697158 15.438979 8.070000 1.501068 1498.466243 -500.483984 1497.986986 -519.200340 1496.569047 -487.479567 ... 15.026853 -499.723143 11.011607 -499.985046 9.009783 -499.937902 9.012660 -500.154284 2017 12
5855 181.556856 14.995850 8.070000 1.623454 1498.096303 -499.796922 1501.743791 -505.146931 1499.535978 -492.428226 ... 14.914199 -499.948518 10.986607 -500.658027 8.989497 -500.337588 8.988632 -500.764937 2017 12

5856 rows × 54 columns

In [145]:
# Separate the targets from train_w_rougher; so both targets have the same number of predictions

true_train_targets = train_w_rougher[['final.output.recovery','rougher.output.recovery']]
true_train_targets_rougher = true_train_targets['rougher.output.recovery']
true_train_targets_rougher_full = train_w_rougher.drop(columns = ['final.output.recovery'])
true_train_targets_final = true_train_targets['final.output.recovery']
true_train_targets_final_full = train_w_rougher.drop(columns = ['rougher.output.recovery'])
true_train_features = train_w_rougher.drop(columns = ['final.output.recovery','rougher.output.recovery'])
true_train_targets_final_full
Out[145]:
final.output.recovery primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level primary_cleaner.state.floatbank8_c_air ... secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level year month
0 70.541216 127.092003 10.128295 7.25 0.988759 1549.775757 -498.912140 1551.434204 -516.403442 1549.873901 ... 12.099931 -504.715942 9.925633 -498.310211 8.079666 -500.470978 14.151341 -605.841980 2016 1
1 69.266198 125.629232 10.296251 7.25 1.002663 1576.166671 -500.904965 1575.950626 -499.865889 1575.994189 ... 11.950531 -501.331529 10.039245 -500.169983 7.984757 -500.582168 13.998353 -599.787184 2016 1
2 68.116445 123.819808 11.316280 7.25 0.991265 1601.556163 -499.997791 1600.386685 -500.607762 1602.003542 ... 11.912783 -501.133383 10.070913 -500.129135 8.013877 -500.517572 14.028663 -601.427363 2016 1
3 68.347543 122.270188 11.322140 7.25 0.996739 1599.968720 -500.951778 1600.659236 -499.677094 1600.304144 ... 11.999550 -501.193686 9.970366 -499.201640 7.977324 -500.255908 14.005551 -599.996129 2016 1
4 66.927016 117.988169 11.913613 7.25 1.009869 1601.339707 -498.975456 1601.437854 -500.323246 1599.581894 ... 11.953070 -501.053894 9.925709 -501.686727 7.894242 -500.356035 13.996647 -601.496691 2016 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
16855 73.755150 123.381787 8.028927 6.50 1.304232 1648.421193 -400.382169 1648.742005 -400.359661 1648.578230 ... 20.007571 -499.740028 18.006038 -499.834374 13.001114 -500.155694 20.007840 -501.296428 2018 8
16856 69.049291 120.878188 7.962636 6.50 1.302419 1649.820162 -399.930973 1649.357538 -399.721222 1648.656192 ... 20.035660 -500.251357 17.998535 -500.395178 12.954048 -499.895163 19.968498 -501.041608 2018 8
16857 67.002189 105.666118 7.955111 6.50 1.315926 1649.166761 -399.888631 1649.196904 -399.677571 1647.896999 ... 19.951231 -499.857027 18.019543 -500.451156 13.023431 -499.914391 19.990885 -501.518452 2018 8
16858 65.523246 98.880538 7.984164 6.50 1.241969 1646.547763 -398.977083 1648.212240 -400.383265 1648.917387 ... 20.054122 -500.314711 17.979515 -499.272871 12.992404 -499.976268 20.013986 -500.625471 2018 8
16859 70.281454 95.248427 8.078957 6.50 1.283045 1648.759906 -399.862053 1650.135395 -399.957321 1648.831890 ... 20.020205 -500.220296 17.963512 -499.939490 12.990306 -500.080993 19.990336 -499.191575 2018 8

13582 rows × 55 columns

In [146]:
# Test the best model (RandomForest) and test it using the true test sample

# Get prediction for final output
best_model = model_rf  
best_model.fit(true_train_features, true_train_targets_final)

predictions_test_final = best_model.predict(gold_test_w_true_features)

print("Model Predictions for final.output.recovery (first 10):", predictions_test_final[:10])
Model Predictions for final.output.recovery (first 10): [67.81991271 68.35690678 68.14608863 67.68091849 69.04425219 68.95453817
 66.82514134 65.0654562  65.68501128 65.81068977]
In [147]:
# Get prediction for rougher output
best_model = model_rf  
best_model.fit(true_train_features, true_train_targets_rougher)

predictions_test_rougher = best_model.predict(gold_test_w_true_features)

print("Model Predictions for rougher.output.recovery (first 10):", predictions_test_rougher[:10])
Model Predictions for rougher.output.recovery (first 10): [89.32562713 85.24048458 86.04925568 84.58411688 88.13906292 86.41778105
 74.77973144 73.4266545  69.30563899 75.36474826]
In [148]:
# Call your original final_smape function for aligned features
print("Random Forest Model: Final sMAPE (with aligned features)")
print("-------------------------------------")
display(final_smape(true_train_targets_final_full,
                    'final.output.recovery',
                    train_w_rougher,
                    'rougher.output.recovery',
                    gold_test_w_true_features,
                    model_final=RandomForestRegressor(random_state = 1234),
                    model_rougher=RandomForestRegressor(random_state=1234)
))
Random Forest Model: Final sMAPE (with aligned features)
-------------------------------------
12.278200328481306
In [149]:
# Quick hypertuning for best results for final output
param_grid = {'max_depth': [3, 5, 7], 'n_estimators': [50, 100]}
grid = GridSearchCV(RandomForestRegressor(random_state=1234), param_grid, cv=5, scoring='neg_mean_absolute_error')
grid.fit(true_train_features, true_train_targets_final)

print("Best parameters (final output):", grid.best_params_)
print("Best score (final output):", -grid.best_score_)
Best parameters (final output): {'max_depth': 5, 'n_estimators': 100}
Best score (final output): 5.506685259260206
In [150]:
# Quick hypertuning for best results for rougher output
param_grid = {'max_depth': [3, 5, 7], 'n_estimators': [50, 100]}
grid = GridSearchCV(RandomForestRegressor(random_state=1234), param_grid, cv=5, scoring='neg_mean_absolute_error')
grid.fit(true_train_features, true_train_targets_rougher)

print("Best parameters (rougher output):", grid.best_params_)
print("Best score (rougher output):", -grid.best_score_)
Best parameters (rougher output): {'max_depth': 5, 'n_estimators': 100}
Best score (rougher output): 6.775727004776654
In [151]:
# Call your original final_smape function w/ hypertuned model
print("Random Forest Model: Final sMAPE (with hypertuned model)")
print("-------------------------------------")
display(final_smape(true_train_targets_final_full,
                    'final.output.recovery',
                    train_w_rougher,
                    'rougher.output.recovery',
                    gold_test_w_true_features,
                    model_final=RandomForestRegressor(random_state = 1234, max_depth = 5, n_estimators = 100),
                    model_rougher=RandomForestRegressor(random_state=1234, max_depth = 5, n_estimators = 100)
))
Random Forest Model: Final sMAPE (with hypertuned model)
-------------------------------------
11.477138030701516

Final Model Evaluation: Random Forest on True Test Set

After initial model comparison, the Random Forest model was selected for final evaluation on the complete test set with true target values. Hyperparameter tuning was applied to optimize performance.


Model Performance Progression

Stage Model Configuration Final sMAPE (%) Rougher MAE Final MAE Key Improvement
Initial Baseline Default Random Forest 14.30% 6.26 6.86 Cross-validation results
True Test Set Default Random Forest 12.28% - - -2.02pp improvement on real data
Hypertuned Model max_depth=5, n_estimators=100 11.48% 6.78 5.51 -0.80pp final improvement

Hyperparameter Tuning Results

Optimal parameters identified through GridSearchCV (5-fold):

Target Variable Best max_depth Best n_estimators Cross-Validation MAE
Final Output Recovery 5 100 5.51
Rougher Output Recovery 5 100 6.78

Both recovery predictions benefited from the same hyperparameter configuration, suggesting consistent optimal model complexity across targets.


Key Findings

Performance improvement through optimization:

  • Initial model (14.30% sMAPE) → True test set (12.28%) → Hypertuned (11.48%)
  • Total improvement: 2.82 percentage points (19.7% relative reduction in error)
  • Hypertuning alone contributed 0.80pp improvement beyond model selection

Model characteristics:

  • Optimal depth (5): Prevents overfitting while capturing complex flotation relationships
  • Optimal estimators (100): Balances ensemble strength with computational efficiency
  • Consistent parameters: Same configuration optimal for both rougher and final recovery predictions

Prediction patterns:

  • Final recovery predictions are more stable (±3% range in sample)
  • Rougher recovery predictions show higher variance (±20% range in sample)
  • This reflects real process behavior: rougher stage is more sensitive to ore variability

Conclusion

The hypertuned Random Forest model achieves a final sMAPE of 11.48% on the true test set, representing strong predictive accuracy for gold recovery optimization. The model successfully balances complexity (max_depth=5) with ensemble power (n_estimators=100), avoiding overfitting while maintaining robust performance on unseen data.

With an average prediction error of approximately 11.5%, the model provides reliable forecasts for both rougher and final stage gold recovery, enabling process optimization decisions with quantifiable confidence intervals.

Final Conclusion¶

This project developed a predictive model for gold recovery in a flotation enrichment process, achieving a final sMAPE of 11.48% through comprehensive data preprocessing, feature engineering, and hyperparameter optimization.


Data Preprocessing Summary¶

Missing Data Strategy:

Category Features Approach Coverage
High correlation (sulfate) floatbank10/11_sulfate Conditional imputation based on concentration ranges 99.2% pattern match
Near-perfect match (xanthate) floatbank10/11_xanthate Direct copy (difference ≈ 0) ~0 median difference
Strong correlation (air) floatbank10_e/f_air Copy within normal range (844-856) 99.5% variance reduction
Range-specific (feed_sol) feed_sol Median/offset by feed_size range 9 range-specific strategies
Minimal missing (<1%) 43 remaining features Training set median Negligible impact

Key preprocessing achievements:

  • Removed 34 output/calculation features not available at prediction time
  • Eliminated 9.06% of training rows with missing target values
  • Applied sophisticated conditional imputation preserving process relationships
  • Validated formula accuracy: rougher recovery MAE = 9.3e-15 (essentially perfect)

Outlier Analysis: Original vs. Imputed Data¶

Metal Concentration Patterns:

Metal Original Min Imputed Min Std Dev Change Key Finding
Gold (Au) 0.00 0.01 -36% (3.03 → 1.94) Zeros eliminated, variance reduced
Silver (Ag) 0.00 0.01 -37% (3.13 → 1.98) Improved data quality
Lead (Pb) 0.00 0.01 -23% (1.45 → 1.11) Tighter distribution

Critical findings from original data:

  • Gold: Proper concentration trend (5.7x enrichment), but extreme outliers up to 53.61 in final concentrate
  • Silver: Inverted pattern (final tail > final concentrate) indicates rejection, not recovery - not suitable as target
  • Lead: Correct concentration progression (2.9x enrichment) with moderate outliers
  • Zero values: Present across all stages in original data - eliminated through imputation as measurement errors

Total concentration validation:

Stage Dataset Mean Std Dev Min Max Assessment
Feed Original 18.99 7.30 0.00 35.07 Zeros indicate measurement gaps
Feed Imputed 20.29 4.59 0.03 34.83 More reliable, tighter distribution
Rougher Concentrate Original 35.65 13.22 0.00 55.57 High variance with zeros
Final Concentrate Original 53.88 17.70 0.00 65.58 Extreme spread

Distribution Consistency: Train-Test Alignment¶

Feed Size Distribution:

Dataset Train Mean Test Mean Train Median Test Median Std Dev Difference
Original 58.68 55.94 54.10 50.00 +1.2
Imputed 59.92 55.93 55.25 50.11 +0.3

Validation: Train-test distributions remain aligned after imputation, with slightly improved consistency. No significant distribution shift detected - model evaluation will be unbiased.


Model Performance Evolution¶

Initial Model Comparison (Cross-Validation):

Model Final sMAPE (%) Cross-Val MAE Interpretation
Random Forest 14.30% 6.26 Best performance, robust
Linear Regression 14.64% 6.86 Strong alternative
Decision Tree 22.48% N/A Overfitting, not recommended

Final Model Optimization:

Stage Configuration Final sMAPE (%) Final MAE Rougher MAE Improvement
Baseline Default RF 14.30% 6.86 6.26 Initial
True Test Default RF 12.28% - - -2.02pp
Hypertuned depth=5, n=100 11.48% 5.51 6.78 -2.82pp total

Hyperparameter optimization results:

Target Variable Optimal Depth Optimal Estimators CV MAE
Final Output Recovery 5 100 5.51
Rougher Output Recovery 5 100 6.78

Key Technical Achievements¶

  1. Data Quality Enhancement:

    • Sophisticated conditional imputation based on process relationships (not simple statistical fills)
    • Eliminated measurement errors (zeros) while preserving valid extreme values
    • Maintained train-test distribution alignment throughout preprocessing
  2. Process Understanding:

    • Validated flotation chemistry relationships (xanthate equilibrium, sulfate patterns)
    • Identified silver rejection pattern (not a recovery target)
    • Established normal operating ranges for air and chemical dosing
  3. Model Development:

    • Random Forest selected over simpler models (Decision Tree overfits, Linear Regression comparable but less robust)
    • Hyperparameter tuning achieved 19.7% relative error reduction from baseline
    • Final model balances complexity (depth=5) with ensemble strength (100 trees)
  4. Prediction Reliability:

    • Final sMAPE of 11.48% means predictions within ~11.5% of actual values
    • More stable predictions for final recovery (±3% range) than rougher recovery (±20% range)
    • Reflects real process: rougher stage more sensitive to ore variability

Business Impact¶

The optimized Random Forest model provides reliable gold recovery predictions with approximately 11.5% average error, enabling:

  • Process optimization: Predict recovery rates before processing full batches
  • Quality control: Identify conditions likely to produce suboptimal recovery
  • Cost reduction: Adjust chemical dosing and operating parameters proactively
  • Decision support: Quantifiable confidence intervals for production planning

Model deployment readiness: With comprehensive preprocessing pipelines validated on test data and hyperparameters optimized through cross-validation, the model is production-ready for integration into flotation process control systems.


Recommendations for Deployment¶

  1. Monitor data quality: Continue investigating zero values and extreme outliers in real-time sensor data
  2. Retrain periodically: Update model as ore characteristics or process conditions change
  3. Feature importance analysis: Identify which sensors/parameters most influence recovery for targeted monitoring
  4. Silver processing: If silver recovery becomes economically important, investigate why current process rejects it
  5. Ensemble expansion: Consider adding gradient boosting models (XGBoost, LightGBM) for potential further improvements

The project successfully demonstrates that machine learning can accurately predict gold recovery in complex flotation processes, providing actionable insights for operational optimization.



Condensed Conculsions per Section¶

MAE Calculations for Rougher Recovery Calculation (Training Set)¶

Category Metric / Finding Result Interpretation
Formula Check Mean Absolute Error (MAE) 9.3e-15 ~0% → Perfect formula-target alignment
Comparison Quality Values differing by > 1e-20 0 All differences are only floating-point precision
Data Coverage Total rows in training set 16,860 Full dataset size
Valid formula results 14,287 (84.7%) Rows usable for comparison
NaN in formula results 2,283 (13.5%) Missing required input columns
NaN in target values 2,573 (15.3%) Expected missing measurements
Reliability Formula Accuracy ✅ Perfect Matches target column exactly
Data Integrity ⚠️ Good Some NaNs, but typical in industrial datasets
Risk of Error ✅ Low No meaningful calculation errors detected

Conclusion:

The MAE calculation for the training dataset using the Recovery formula for the predictions and the training datasets rougher.output.recovery column was 9.3e-15 meaning the difference between the predicted and actual values was negligable. Thus, the recovery formula is fully validated. It reproduces the target values with perfect precision (MAE ≈ 0) on all usable rows. While ~15% of rows contain missing data, this is expected and does not affect formula correctness. The dataset is reliable for model development, with 84.7% coverage available for training and validation.

Missing Test Features¶

Summary by Parameter Type (Test Set):

Parameter Type Count Reason for Exclusion
Output 28 Features only known after processing, not available at prediction time
Target 2 Used for stage-specific predictions
Calculations 4 Dependent on outputs/targets → potential data leakage

Note: All features are of Float data type.

Missing Values Overview¶

Summary Statistics

Dataset Total Features Features with Missing Data Complete Features
Full Dataset 87 85 2
Training Set 87 85 2
Test Set 53 51 2

Dataset Comparison Summary

Missing bins All Full Train Test Full + Train Full + Test
< 1% 40 1 1 6 21 –
1 – < 5% 2 1 1 3 10 –
5 – < 10% 1 1 – – 5 1
10 – < 15% – 1 2 – 3 –
≥ 15% – – – 1 – –
Total 43 4 5 9 39 1
Missing Data Level Full Dataset Training Dataset Test Dataset
< 1% 62 (71.26%) features 62 (71.26%) features 45 (84.91%) features
1% - < 5% 14 (16.10%) features 14 (16.10%) features (9.43%) 5 features
5% - < 10% 7 (8.05%) features 5 (5.75%) features 2 (3.77%) features
10% - < 15% 4 (4.60%) features 5 (5.75%) features 0 (0%) features
≥ 15% 0 (0%) features 1 (1.15%) feature 0 (0%) features
Total with Missing Data 87 features 87 features 53 features

Conclusion

The Test dataset is intentionally cleaner, excluding 34 output/target/calculation features that drive most missing values in the Full and Training sets. Across shared input features, data quality is generally consistent, with the Test set slightly cleaner.

Most features have negligible missingness (<1%), while moderate gaps (1–10%) are manageable with imputation. Severe missingness (≥10%) is confined to output and calculation features, confirming that predictors remain reliable. This design ensures the Test set is suitable for unbiased model evaluation, while preprocessing should focus on imputing minor gaps in predictors and excluding heavily missing output features.

Missing Data Summary (Target NaN's Removed) - Side-by-Side Comparison¶

Dataset Overview

Metric Training Dataset Full Dataset
Missing final.output.recovery values 1,521 1,963
Rows (no NaN values) 15,339 20,753
Data Removed 9.06% 8.64%

Features with ≥ 1% Missing Data - Side-by-Side Comparison

Rank Training Dataset % Missing Full Dataset % Missing Difference
1 secondary_cleaner.output.tail_sol ~11.6% " " ~9.4% Training +2.2%
2 rougher.output.recovery ~7.8% " " ~6.3% Training +1.5%
3 rougher.output.tail_ag ~6.3% " " ~5.1% Training +1.2%
4 rougher.output.tail_sol ~6.3% " " ~5.1% Training +1.2%
5 rougher.output.tail_au ~6.3% " " ~5.1% Training +1.2%
6 rougher.input.floatbank11_xanthate ~5.1% " " ~3.9% Training +1.2%
7 rougher.state.floatbank10_e_air ~3.5% " " ~2.6% Training +0.9%
8 primary_cleaner.output.concentrate_sol ~2.7% " " ~2.5% Training +0.2%
9 primary_cleaner.input.sulfate ~2.5% " " ~1.9% Training +0.6%
10 rougher.input.floatbank10_sulfate ~2.4% " " ~1.8% Training +0.6%
11 rougher.input.floatbank11_sulfate ~2.3% " " ~1.8% Training +0.5%
12 primary_cleaner.input.xanthate ~1.8% " " ~1.4% Training +0.4%
13 final.output.concentrate_sol ~1.7% " " ~1.3% Training +0.4%
14 primary_cleaner.input.depressant ~1.7% " " ~1.3% Training +0.4%
15 secondary_cleaner.state.floatbank2_a_air ~1.5% " " ~1.1% Training +0.4%
16 rougher.input.feed_rate ~1.4% " " ~1.1% Training +0.3%
17 primary_cleaner.output.concentrate_pb ~1.0% Not in ≥1% list <1.0% Training only

Conclusion

Training dataset consistently shows worse data quality across all features compared to Full dataset. Even after removing final.output.recovery NaN values, Training exhibits 0.2-2.2% higher missing data rates for every comparable feature. Output features are most problematic (5-12% missing rates), while input features remain manageable (<3% missing). The Training dataset loses 9.06% of rows during cleaning versus 8.64% for Full dataset, suggesting the Training subset has additional missing data issues that extend beyond the target variable alone.

Missing Data Analysis - Cleaned Datasets¶

Features with ≥ 1% Missing Data - Side-by-Side Comparison After removing output/calculation columns and final.output.recovery NaN values

Feature Training Count Training % Full Count Full % Difference
rougher.input.floatbank11_xanthate 779 5.08% 812 3.91% Training +1.17%
rougher.state.floatbank10_e_air 532 3.47% 532 2.56% Training +0.91%
primary_cleaner.input.sulfate 381 2.48% 388 1.87% Training +0.61%
rougher.input.floatbank10_sulfate 375 2.44% 380 1.83% Training +0.61%
rougher.input.floatbank11_sulfate 357 2.33% 368 1.77% Training +0.56%
primary_cleaner.input.xanthate 276 1.80% 282 1.36% Training +0.44%
primary_cleaner.input.depressant 257 1.68% 263 1.27% Training +0.41%
secondary_cleaner.state.floatbank2_a_air 230 1.50% 233 1.12% Training +0.38%
rougher.input.feed_rate 218 1.42% 221 1.06% Training +0.36%

Conclusion Summary

Bottom line: 9 features need missing data fixes, all manageable. Training dataset has consistently more gaps than Full dataset. The worst is rougher.input.floatbank11_xanthate at 5% missing - the rest are under 3%.

Simple solution: plan for imputation. Can't just delete rows with these missing rates, but standard imputation techniques should handle it easily. Problem identified, solution clear.

Missing Data Analysis - Updated Cleaned Datasets¶

Features with ≥ 1% Missing Data - Side-by-Side Comparison

After removing output/calculation columns, final.output.recovery NaN values, AND features with <1% missing data

Dataset Overview

  • Training Dataset: 14,855 rows (after all cleaning)
  • Full Dataset: 20,226 rows (after all cleaning)
Feature Training Count Training % Full Count Full % Difference
rougher.input.floatbank11_xanthate 654 4.40% 684 3.38% Training +1.02%
rougher.state.floatbank10_e_air 508 3.42% 508 2.51% Training +0.91%
rougher.input.floatbank11_sulfate 262 1.76% 272 1.34% Training +0.42%
primary_cleaner.input.sulfate 259 1.74% 263 1.30% Training +0.44%
rougher.input.floatbank10_sulfate 259 1.74% 261 1.29% Training +0.45%
secondary_cleaner.state.floatbank2_a_air 220 1.48% 223 1.10% Training +0.38%
primary_cleaner.input.xanthate 190 1.28% 193 0.95% Training +0.33%
primary_cleaner.input.depressant 189 1.27% 193 0.95% Training +0.32%
rougher.input.feed_rate 183 1.23% 185 0.91% Training +0.32%

Conclusion

Final scope: exactly 9 features need missing data attention. After aggressive cleaning, these are the only features with meaningful missing data gaps. Training dataset consistently shows 0.32-1.02% higher missing rates than Full dataset.

Rougher.input.floatbank11_xanthate remains the main challenge at 4.4% missing - everything else is under 2%. Standard imputation will handle this easily.

Xanthate Difference Analysis: rougher.input.floatbank##_xanthate¶

Distribution: The data appears roughly normally distributed with a slight right skew, centered around 6-7

Training | Full

image.png image.png

image.png image.png

After Imputation

image.png

Statistical Summary by Difference Groups

xanthate_difference = rougher.input.floatbank10_xanthate - rougher.input.floatbank11_xanthate

Group Category Variable Median Mean Range
< -1 (174 observations)
floatbank10_xanthate 6.243 4.438 0.001 - 7.625
floatbank11_xanthate 7.494 6.770 1.482 - 8.834
xanthate_difference -1.416 -2.333 -8.005 - -1.001
> 1 (93 observations)
floatbank10_xanthate 5.781 5.703 3.109 - 8.036
floatbank11_xanthate 1.931 2.413 0.000 - 6.318
xanthate_difference 3.204 3.290 1.026 - 7.576
0 to 1 (6,850 observations)
floatbank10_xanthate 6.000 5.908 0.004 - 9.703
floatbank11_xanthate 5.998 5.898 0.002 - 9.698
xanthate_difference 0.002 0.010 0.000 - 0.953
-1 to 0 (7,084 observations)
floatbank10_xanthate 5.995 5.858 0.001 - 9.655
floatbank11_xanthate 6.000 5.902 0.001 - 9.667
xanthate_difference -0.002 -0.044 -1.000 - 0.000
ALL (14,201 observations)
floatbank10_xanthate 5.998 5.864 0.001 - 9.703
floatbank11_xanthate 5.999 5.888 0.000 - 9.698
xanthate_difference -0.000 ✅ -0.024 -8.005 - 7.586
  • Training Dataset Difference Median: -0.000087
  • Full Dataset Difference Median: -0.000075
  • Training and Full Median Difference: 0.000012

Conclusion

Given that the median difference between rougher.input.floatbank10_xanthate and rougher.input.floatbank11_xanthate is approximately 0 (-0.000075), and most observations (98.1%) fall within small differences, imputing missing floatbank11 values using the corresponding floatbank10 values (i.e., floatbank11 = floatbank10) appears reasonable. However, this assumes missing values follow the same near-equilibrium pattern as the majority of the data. This should give the most accurate representation for our model.

The median difference for the Training Set (-0.000075) and Full Set (-0.000087) are virtually the same at 0 (0.000012), validating our imputation approach. Moreover, when observing the datsets after imputation, you see that the Full Dataset changed less than the Training Dataset; this further demonstrates the accuracy of our imputation strategy.

Floatbank Air Imputation Strategy¶

Analysis of the relationship between rougher.state.floatbank10_e_air and rougher.state.floatbank10_f_air reveals that filtering both variables to the normal operating range (844-856) successfully eliminates extreme outliers and isolates reliable data.

Before Imputation:

image.png

After Imputation:

image.png

Statistical Comparison:

Metric f_air filtered (844-856) e_air filtered (844-856) Improvement
Length 1,060 523 -50.7% (e_air outliers removed)
Median Difference 0.031 0.061 Stable
Mean Difference -25.269 0.032 +99.9% (bias eliminated)
Standard Deviation 129.603 0.677 -99.5% (variance reduced)
Min Difference -1,072.173 -4.910 Extreme outliers removed
Max Difference 302.433 2.772 Extreme outliers removed

Filtered Dataset Distribution (e_air filtered to 844-856)

Group Category Observations Percentage Key Finding
Difference > 0 284 54.3% e_air slightly lower than f_air
Difference ≤ 0 239 45.7% e_air slightly higher than f_air
Extreme Differences (>5 or <-5) 0 0% All anomalies eliminated

Key Findings

  • Strong linear relationship: Median difference of 0.061 confirms floatbank10_e_air ≈ floatbank10_f_air under normal conditions
  • All extreme anomalies eliminated: 523 observations remain within reasonable sensor variance (-4.91 to +2.77)
  • Data quality dramatically improved: 99.5% reduction in standard deviation, eliminating measurement bias

Recommended imputation approach: Drop the 2 missing values where floatbank10_f_air falls outside the 844-856 range, and impute remaining missing floatbank10_e_air values using floatbank10_e_air = floatbank10_f_air. This strategy leverages the strong correlation between variables while focusing on normal operating conditions, providing the most accurate representation for modeling purposes.

Sulfate Imputation Strategy Conclusion¶

Analysis of the relationship between rougher.input.floatbank10_sulfate and rougher.input.floatbank11_sulfate reveals distinct patterns based on the operating range of sulfate11 values, enabling a targeted approach to missing value imputation.

Histogram (All Data - Sulfate11): Training Set

image.pngimage.png

Histogram (Range 0-5.9 - Sulfate 11): Training Set

image.pngimage.png

Histogram (Range 0-11 - Sulfate 10): Training Set - for test set

image.png

Statistical Summary

Dataset Segment Observations Median Difference Mean Difference Key Characteristic
Overall Dataset 14,853 0.000085 0.360 Balanced but with outliers
Normal Range (sulfate11: 5.9-13.1) 9,953 ~0.002 ~0.001 Near-perfect equilibrium
Normal Range (sulfate11 ≤ 2 - 5.9) 280 ~0.00015 ~0.047 Equilibrium
Anomalous Range (sulfate11 0 - 2) 379 ~12.96 ~12.5 Sensor Disparity
Normal Range (sulfate10: 0-11.0) 6,076 ~-0.00014 -0.014 Near-perfect equilibrium

Identified Operating Range

Normal Operating Conditions (sulfate11: 5.9-13.1):

  • 66.7% of all observations fall within -1 to +1 difference range
  • Median differences approach zero across all subgroups
  • Both sensors track closely with minimal bias
  • Represents reliable, balanced sensor measurements

Normal Operating Conditions (sulfate11: 2.0-5.9):

  • A large portion of all observations fall within -1 to +1 difference range
  • Median differences approach zero
  • Both sensors track closely with minimal bias
  • Represents reliable, balanced sensor measurements

Outlier Conditions (sulfate11 0-2):

  • 379 observations show extreme positive differences (median: 12.96)
  • Indicates one sensor reading near-zero while other reads ~13
  • Likely represents sensor malfunction or extreme process conditions

Normal Operating Conditions (sulfate10: 0-11):

  • A large portion of all observations fall within -1 to +1 difference range
  • Median differences approach zero
  • Both sensors track closely with minimal bias
  • Represents reliable, balanced sensor measurements

Recommended Imputation Strategy

For missing floatbank10_sulfate values:

  • When floatbank11_sulfate is between 5.9-13.1: Use floatbank10_sulfate = floatbank11_sulfate
  • When floatbank11_sulfate is between 2.0 - 5.9: Use floatbank10_sulfate = floatbank11_sulfate
  • When floatbank11_sulfate is between 0 - 2.0: Use floatbank10_sulfate = floatbank11_sulfate + 12.96
    • When floatbank10_sulfate is between 0 - 11: Use floatbank11_sulfate = floatbank10_sulfate

Conclusion:

The relationship between floatbank 10 and 11 sulfate measurements follows two distinct conditional patterns: near-perfect correlation (difference ≈ 0.00015) in the mid and high range (2.0-5.9; 5.9-13.1), and a consistent +13 unit offset (difference = 12.96) in the low range (0.0-2.0). Conditional imputation based on sulfate 11 concentration provides a reliable, domain-informed approach that preserves the natural process relationships.

The relationship between floatbank 11 and 10 sulfate measurements is dominated by a single strong pattern: when sulfate 10 is between 0-11, the measurements are nearly identical (difference ≈ -0.00014). With 6,076 training observations representing 99.2% coverage (6,076/6,128) in this range, the near-perfect correlation provides extremely high confidence for imputation.

Feed Sol Imputation Strategy Conclusion¶

Analysis of the relationship between rougher.input.feed_sol and rougher.input.feed_size reveals that optimal imputation varies by feed_size range, requiring a range-specific approach using either median values or calculated offsets depending on data distribution characteristics.

image.pngimage.png

image.pngimage.png

image.pngimage.png

image.pngimage.png


Statistical Summary by Feed Size Range

Feed Size Range Observations Feed Sol Median Difference Median Imputation Method Rationale
24 - 30 (diff -5 to 12) 10 25.22 2.25 Median (25.22) for feed_size < 25.5 Small sample; lower sol values
24 - 30 (diff < -5) 33 39.27 -11.22 Median (39.27) for feed_size ≥ 25.5 Better sample; sol cluster 34-40
30 - 35 60 36.61 -4.45 Median (36.61) Moderate sample; sol range 30-43
35 - 38 97 30.25 6.91 Offset (+6.9) Difference tighter than sol range
40 - 50 3,866 34.54 11.81 Offset (+11.8) Large sample; consistent difference
50 - 60 4,927 36.80 17.62 Median (36.80) Largest sample; stable median
60 - 70 2,259 38.94 25.30 Median (38.94) Large sample; stable median
70 - 75 777 39.41 33.27 Median (39.41) Moderate sample
80 - 85 479 40.66 42.25 Median (40.66) Moderate sample

Key Findings

The feed_sol imputation strategy is more complex than the sulfate measurements, with decisions based on sample size, value clustering, and pattern consistency. Two primary ranges provide the most reliable imputation: feed_size 50-60 (4,927 observations) and 40-50 (3,866 observations). The 24-30 range splits into two distinct patterns based on difference values, with 33 observations supporting the higher sol median (39.27) and only 10 supporting the lower median (25.22). Median-based imputation was preferred for most ranges due to stable value clustering, while offset-based imputation was used for ranges 35-38 and 40-50 where the difference showed more consistent patterns. Unlike the sulfate strategies which achieved 99%+ coverage with near-perfect correlation, this approach has weaker correlations and higher variance, particularly in extreme ranges with limited observations, resulting in lower precision but adequate accuracy for modeling purposes.


Recommended imputation approach: Apply range-specific imputation to the test set based on feed_size value, using median sol values for ranges with stable clustering (24-35, 50-85) and offset calculations for ranges showing consistent difference patterns (35-38, 40-50). This pragmatic strategy leverages the strongest patterns in the largest sample ranges while accepting higher uncertainty in edge cases, providing adequate imputation quality for model training despite weaker underlying relationships compared to the sulfate features.

Remaining Columns Imputation Strategy (<1% Missing Data)¶

For all remaining columns in the test set with less than 1% missing data, a simple median imputation strategy was applied using column-specific medians calculated from the training set.


Methodology

  • Calculate the median value for each column from the training set (after row-dropping preprocessing)
  • Apply these training set medians to fill missing values in the corresponding test set columns
  • No conditional logic or range-based strategies required

Rationale: With <1% missing data per column, simple median imputation is computationally efficient and has negligible impact on model performance. Using training set medians (rather than test set medians) prevents data leakage and maintains proper train-test separation.


Key Considerations

For columns with <1% missing data, the additional complexity of conditional imputation is unnecessary. The simple median approach provides a clean, efficient solution that maintains data integrity while having minimal impact on final model predictions.

Original Dataset: Outlier and Distribution Summary¶

Analysis of the pre-imputation dataset reveals substantial variance and outliers across all metal concentrations, especially in gold (Au). Zero values appear across all stages, suggesting possible measurement errors or process shutdowns.

image.png


Distribution Summary by Metal Type

Stage Median Mean Std Dev Min Max Key Observation
GOLD: Rougher Feed 7.88 7.57 3.03 0.00 14.09 Baseline input
GOLD: Rougher Concentrate 20.00 17.88 6.79 0.00 28.82 2.5x concentration from feed
GOLD: Rougher Tail 1.81 1.82 0.70 0.02 9.69 Low concentration, good separation
GOLD: Primary Cleaner Concentrate 32.36 29.21 10.54 0.00 45.93 Highest variance (±10.54)
GOLD: Primary Cleaner Tail 3.51 3.67 1.99 0.00 18.53 Moderate loss in tail
GOLD: Secondary Cleaner Tail 3.96 4.04 2.61 0.00 26.81 Higher variance than primary tail
GOLD: Final Concentrate 44.65 40.00 13.40 0.00 53.61 Maximum enrichment achieved (5.7x feed)
GOLD: Final Tail 2.91 2.83 1.26 0.00 9.79 Minimal loss, efficient recovery
SILVER: Rougher Feed 8.30 8.07 3.13 0.00 14.87 Baseline input
SILVER: Rougher Concentrate 11.79 10.87 4.38 0.00 24.48 1.4x concentration from feed
SILVER: Rougher Tail 5.76 5.59 1.11 0.59 12.72 Higher than Au tail (less efficient separation)
SILVER: Primary Cleaner Concentrate 8.27 7.69 3.11 0.00 16.08 Lower than rougher concentrate (unusual)
SILVER: Primary Cleaner Tail 15.60 14.88 6.54 0.00 29.46 Higher than concentrate (inverted pattern)
SILVER: Secondary Cleaner Tail 15.22 13.38 5.77 0.00 23.26 Similar to primary cleaner tail
SILVER: Final Concentrate 4.95 4.78 2.03 0.00 16.00 Low enrichment (~0.6x feed)
SILVER: Final Tail 9.48 8.92 3.52 0.00 19.55 Higher than concentrate (poor recovery)
LEAD: Rougher Feed 3.43 3.31 1.45 0.00 7.14 Baseline input (lowest of 3 metals)
LEAD: Rougher Concentrate 7.57 6.90 2.81 0.00 18.39 2.2x concentration from feed
LEAD: Rougher Tail 0.59 0.59 0.32 0.00 3.78 Excellent separation (82% reduction)
LEAD: Primary Cleaner Concentrate 9.92 8.92 3.71 0.00 17.08 Further enrichment to 3x feed
LEAD: Primary Cleaner Tail 3.15 3.18 1.65 0.00 9.63 Moderate loss in tail
LEAD: Secondary Cleaner Tail 5.07 5.30 3.09 0.00 17.04 Higher than primary tail
LEAD: Final Concentrate 9.91 9.10 3.23 0.00 17.03 Maximum enrichment (2.9x feed)
LEAD: Final Tail 2.65 2.49 1.19 0.00 6.09 Low loss, good overall recovery

image.png

Gold (Au)

  • Correct overall concentration pattern (feed → concentrate → tail).
  • Extreme outliers in primary/secondary cleaner and final concentrate stages (up to 53.61).
  • High variance (std dev 10–13) indicates fluctuating ore grades or inconsistent separation efficiency.
  • Zero values in concentrate stages may signal failed or unrecorded process runs.

Silver (Ag)

  • Inverted concentration pattern: tails often exceed concentrates (e.g., final tail > final concentrate).
  • Indicates potential data labeling issues or silver rejection during gold purification.
  • Outliers up to 29.46 and high tail variance (std dev ~6) complicate modeling.

Lead (Pb)

  • Consistent concentration trend (concentrate > tail) across all stages.
  • Moderate outliers in rougher concentrate and secondary cleaner tail (max ~18).
  • Lowest relative variance among the three metals.

Cross-Metal Insights

  • All metals contain zeros — investigate whether these are true values or missing data.
  • Gold shows the widest range (0–54), silver 0–29, and lead 0–17 → scaling or log transformation recommended.
  • Boxplots confirm: Au has the most extreme outliers, Ag shows structural anomalies, and Pb is most stable.

Recommendations

  • Investigate zero values for potential data or process errors.
  • Validate silver data — inverted trends likely indicate mislabeling or intentional rejection.
  • Apply robust scaling or log transforms to handle wide value ranges and outliers.
  • Flag abnormal tails (e.g., Au secondary cleaner tail) as potential inefficiencies.

Original Dataset Vs. Imputed Dataset: Outlier Analysis and Distribution Summary¶

Metal Original Median Imputed Median Original Mean Imputed Mean Original Std Imputed Std Original Min Imputed Min Original Max Imputed Max Notes
Au (Gold) 7.88 7.77 7.57 8.01 3.03 1.94 0.00 0.01 14.09 13.90 Less spread and no zeros after imputation — good
Ag (Silver) 8.30 8.28 8.07 8.71 3.13 1.98 0.00 0.01 14.87 14.6 Same pattern — realistic floor now
Pb (Lead) 3.43 3.47 3.31 3.57 1.45 1.11 0.00 0.01 7.14 7.14 Minor tightening — variance reduced

image.pngimage.png

Conclusion

The rougher feed concentrations of Au, Ag, and Pb were compared between the original and imputed datasets. The original data contained several zeros, likely due to missing sensor readings. After imputation, the minimum values increased to 0.01, standard deviations decreased slightly, and overall means remained consistent — indicating improved data integrity without distorting the underlying distributions. Outliers were retained, as they likely reflect genuine variations in ore composition rather than measurement errors.

Feed_Size Distribution: Train Vs. Test¶

Dataset Train Mean Test Mean Train Median Test Median Std Diff Notes
Original 58.68 55.94 54.10 50.00 +1.2 More “natural,” small realistic difference
Imputed 59.92 55.93 55.25 50.11 +0.3 Slightly higher mean in train, but similar pattern

image.pngimage.png

Conclusion

Both the original and imputed datasets were evaluated for feed size distribution consistency between the training and test sets. The original dataset shows slightly more natural variation, while imputation slightly increases the mean feed size in the training set due to smoothing of missing values. However, both datasets maintain comparable distribution shapes and ranges, confirming that train–test distributions are sufficiently aligned for model evaluation.

Total Concentration Comparison: Original vs Imputed¶

Stage Dataset Count Mean Std Dev Min 25% 50% (Median) 75% Max
Feed Original 22,471 18.99 7.30 0.00 16.55 19.63 23.62 35.07
Feed Imputed 14,336 20.29 4.59 0.03 17.00 19.44 23.09 34.83
Rougher Concentrate Original 22,618 35.65 13.22 0.00 37.38 39.98 42.19 55.57
Final Concentrate Original 22,627 53.88 17.70 0.00 58.71 60.08 60.99 65.58

image.pngimage.png

Key Observations:

  1. Zeros removed after imputation: The feed stage now has a minimum of 0.03 vs 0 in the original dataset.
  2. Tighter distribution: Standard deviation decreased in the imputed feed, indicating less extreme spread.
  3. Mean values largely consistent: No major distortion occurred due to imputation.
  4. Rougher and final stages in the original dataset contain zeros; these were not imputed because they won’t be used for modeling, but they show anomalous sensor readings or measurement errors.
  5. Imputed feed data is more reliable for modeling and avoids artificial spikes or zeros.

sMAPE Analysis¶

Model Final sMAPE (%) Interpretation
Linear Regression 14.64 Achieved strong predictive accuracy, indicating relatively low average percentage error between predicted and actual recovery values.
Decision Tree 22.48 Performed noticeably worse than other models, suggesting higher variance or overfitting to the training data.

| Random Forest | 14.30 | Produced the lowest sMAPE, very close to Linear Regression, indicating robust and reliable performance.


Conclusion

Model evaluation using the final sMAPE metric showed that both Linear Regression and Random Forest achieved strong and nearly identical predictive performance, with error rates around 14.3%. In comparison, the Decision Tree model produced a higher sMAPE of 22.5%, indicating less accurate predictions and potential overfitting.

Since the difference between Linear Regression and Random Forest is negligible, both models will be carried forward for further testing and comparison. This approach ensures that the final selection balances predictive accuracy, computational efficiency, and model interpretability.

Cross-Validation Analysis¶

Cross-Validation Results (5-Fold MAE)
Model Average Cross-Validation MAE
Random Forest 6.26
Linear Regression 6.86
Decision Tree N/A (not tested / worse)

Conclusion

The predictive performance of different models was evaluated using both sMAPE and 5-fold cross-validation with mean absolute error.

Random Forest: Produced the lowest sMAPE (14.30%) and the lowest average cross-validation MAE (6.26), indicating strong predictive accuracy and robustness.

Linear Regression: Achieved similar performance, with slightly higher sMAPE (14.64%) and MAE (6.86), suggesting it is also a reliable model, though marginally less accurate than Random Forest.

Decision Tree: Performed noticeably worse, with higher sMAPE (22.48%), indicating higher variance and overfitting risk.

Overall, Random Forest is the top-performing model, though Linear Regression remains a strong alternative. Decision Tree is not recommended due to its comparatively poorer accuracy.

Final Model Evaluation: Random Forest on True Test Set¶

The Random Forest model was evaluated on the complete test set with true target values and optimized through hyperparameter tuning.


Model Performance Progression

Stage Model Configuration Final sMAPE (%) Rougher MAE Final MAE Key Improvement
Initial Baseline Default Random Forest 14.30% 6.26 6.86 Cross-validation results
True Test Set Default Random Forest 12.28% - - -2.02pp improvement on real data
Hypertuned Model max_depth=5, n_estimators=100 11.48% 6.78 5.51 -0.80pp final improvement

Hyperparameter Tuning Results

Optimal parameters identified through GridSearchCV (5-fold):
Target Variable Best max_depth Best n_estimators Cross-Validation MAE
Final Output Recovery 5 100 5.51
Rougher Output Recovery 5 100 6.78

Sample Predictions on Test Set

First 10 predictions demonstrate model output range:
Prediction # Final Output Recovery (%) Rougher Output Recovery (%)
1 67.82 89.33
2 68.36 85.24
3 68.15 86.05
4 67.68 84.58
5 69.04 88.14
6 68.95 86.42
7 66.83 74.78
8 65.07 73.43
9 65.69 69.31
10 65.81 75.36

Conclusion

The hypertuned Random Forest model achieves a final sMAPE of 11.48% on the true test set, representing a 2.82 percentage point improvement from the initial baseline (19.7% relative error reduction). The optimal configuration (max_depth=5, n_estimators=100) balances model complexity with ensemble strength, providing reliable predictions for gold recovery optimization with approximately 11.5% average error.

Supplemental Information¶

MAE Calculation & Rougher Recovery (Training Set)¶

Formula Accuracy Check

Metric Value Interpretation
Mean Absolute Error (MAE) 9.3e-15 ~0% - Formula perfectly matches target values
Values that differ 0 All differences are floating-point rounding errors
Values differing by > 1e-20 0 Confirms perfect formula-target alignment

Dataset Overview

Category Count Percentage Notes
Total training rows 16,860 100% Complete dataset size
Valid formula results 14,287 84.7% Rows where formula could be calculated
NaN in formula results 2,283 13.5% Due to missing required input columns
NaN in target values 2,573 15.3% Missing measurements in dataset
Rows after dropping NaNs 14,287 84.7% Final comparison dataset

Key Findings:

Finding Status Impact
Formula Accuracy ✅ Perfect Known formula perfectly reproduces target values
Data Coverage ⚠️ Good 84.7% of data usable for comparison
Missing Data Pattern ℹ️ Expected NaN values are typical in industrial datasets
Formula Reliability ✅ Excellent Zero meaningful calculation errors detected

Summary: The known formula demonstrates perfect accuracy when applied to the feature columns, with calculated values matching the target column within floating-point precision. This validates both the formula correctness and data quality for 84.7% of the dataset.

Features NOT in the Test Set (34 Total):¶

Features Parameter Notes
final.output.concentrate_ag Output Final concentrate silver
final.output.concentrate_au Output Final concentrate gold
final.output.concentrate_pb Output Final concentrate lead
final.output.concentrate_sol Output Final concentrate solid
final.output.recovery Target Final recovery target
final.output.tail_ag Output Final tailings silver
final.output.tail_au Output Final tailings gold
final.output.tail_pb Output Final tailings lead
final.output.tail_sol Output Final tailings solid
primary_cleaner.output.concentrate_ag Output Primary cleaner concentrate silver
primary_cleaner.output.concentrate_au Output Primary cleaner concentrate gold
primary_cleaner.output.concentrate_pb Output Primary cleaner concentrate lead
primary_cleaner.output.concentrate_sol Output Primary cleaner concentrate solid
primary_cleaner.output.tail_ag Output Primary cleaner tailings silver
primary_cleaner.output.tail_au Output Primary cleaner tailings gold
primary_cleaner.output.tail_pb Output Primary cleaner tailings lead
primary_cleaner.output.tail_sol Output Primary cleaner tailings solid
rougher.calculation.au_pb_ratio Calculations Gold to lead ratio (data leakage)
rougher.calculation.floatbank10_sulfate_to_au_feed Calculations Floatbank10 sulfate to gold feed ratio (data leakage)
rougher.calculation.floatbank11_sulfate_to_au_feed Calculations Floatbank11 sulfate to gold feed ratio (data leakage)
rougher.calculation.sulfate_to_au_concentrate Calculations Sulfate to gold ratio (data leakage)
rougher.output.concentrate_ag Output Rougher concentrate silver
rougher.output.concentrate_au Output Rougher concentrate gold
rougher.output.concentrate_pb Output Rougher concentrate lead
rougher.output.concentrate_sol Output Rougher concentrate solid
rougher.output.recovery Target Rougher recovery target
rougher.output.tail_ag Output Rougher tailings silver
rougher.output.tail_au Output Rougher tailings gold
rougher.output.tail_pb Output Rougher tailings lead
rougher.output.tail_sol Output Rougher tailings solid
secondary_cleaner.output.tail_ag Output Secondary cleaner tailings silver
secondary_cleaner.output.tail_au Output Secondary cleaner tailings gold
secondary_cleaner.output.tail_pb Output Secondary cleaner tailings lead
secondary_cleaner.output.tail_sol Output Secondary cleaner tailings solid

Summary by Parameter Type:

Parameter Type Count Reason for Exclusion
Output 28 Features only known after processing, not available at prediction time
Target 2 Used for stage-specific predictions
Calculations 4 Dependent on outputs/targets → potential data leakage

Note: All features are of Float data type.

Missing Values Overview (≥ 1%)¶

1% - < 5% Missing Data

Shared (Full / Train / Test)
Feature Full % Train % Test %
rougher.input.feed_sol 1.580 1.732 1.148
rougher.input.floatbank10_xanthate 2.065 2.052 2.108
Full & Train Only
Feature Full % Train %
final.output.concentrate_sol 1.695 2.195
primary_cleaner.output.concentrate_pb 1.972 2.123
primary_cleaner.output.concentrate_sol 3.513 3.772
primary_cleaner.output.tail_sol 1.545 1.667
rougher.input.feed_pb 1.074 1.352
rougher.input.feed_rate 2.434 3.043
rougher.input.feed_size 1.933 2.473
rougher.input.floatbank11_sulfate 2.985 3.695
rougher.state.floatbank10_e_air 2.729 3.577
secondary_cleaner.state.floatbank2_a_air 1.686 2.153
Dataset-Specific Features
Category Feature Percentage
Full Only final.output.tail_sol 1.193
Train Only final.output.tail_pb 1.085
Test Only primary_cleaner.input.depressant 4.866
Test Only primary_cleaner.input.xanthate 2.844
Test Only rougher.input.floatbank10_sulfate 4.404

5% - < 10% Missing Data

Shared (Full / Train / Test)
Feature Full % Train % Test %
primary_cleaner.input.sulfate 7.083 7.752 5.175
Full & Train Only
Feature Full % Train %
final.output.recovery 8.641 9.021
primary_cleaner.input.depressant 6.806 7.485
primary_cleaner.input.xanthate 5.067 5.842
rougher.calculation.au_pb_ratio 7.162 7.367
rougher.input.floatbank10_sulfate 5.727 6.192
Full & Test Only
Feature Full % Test %
rougher.input.floatbank11_xanthate 9.936 6.049
Dataset-Specific Features
Category Feature Percentage
Full Only secondary_cleaner.output.tail_sol 9.751

10% - < 15% Missing Data

Full & Train Only
Feature Full % Train %
rougher.output.tail_ag 12.049 13.345
rougher.output.tail_sol 12.044 13.339
rougher.output.tail_au 12.044 13.339
Dataset-Specific Features
Category Feature Percentage
Full Only rougher.output.recovery 13.730
Train Only rougher.input.floatbank11_xanthate 11.293
Train Only secondary_cleaner.output.tail_sol 11.779

≥ 15% Missing Data

Dataset-Specific Features
Category Feature Percentage
Train Only rougher.output.recovery 15.261

Summary Statistics

Dataset Total Features Features with Missing Data Complete Features
Full Dataset 87 85 2
Training Set 87 85 2
Test Set 53 51 2

Dataset Comparison Summary

Missing bins All Full Train Test Full + Train Full + Test
< 1% 40 1 1 6 21 –
1 – < 5% 2 1 1 3 10 –
5 – < 10% 1 1 – – 5 1
10 – < 15% – 1 2 – 3 –
≥ 15% – – – 1 – –
Total 43 4 5 9 39 1
Missing Data Level Full Dataset Training Dataset Test Dataset
< 1% 62 (71.26%) features 62 (71.26%) features 45 (84.91%) features
1% - < 5% 14 (16.10%) features 14 (16.10%) features (9.43%) 5 features
5% - < 10% 7 (8.05%) features 5 (5.75%) features 2 (3.77%) features
10% - < 15% 4 (4.60%) features 5 (5.75%) features 0 (0%) features
≥ 15% 0 (0%) features 1 (1.15%) feature 0 (0%) features
Total with Missing Data 87 features 87 features 53 features

General Threshold Meaning

  • < 1% : Negligible (Imputaion almost never necessary)
  • 1 - < 5% : Minor (Imputation sometimes necessary)
  • 5 - < 10% : Intermediate (Imputation usually necessary)
  • 10 - < 15% : High (Imputation often necessary)
  • ≥ 15% : Extremely High (Imputation almost always necessary)

Key Insights:

  • Test dataset appears cleaner because it excludes 34 output/target/calculation features, which are the primary source of high missingness in the Full and Training sets.
  • Shared features show similar data quality across datasets, with the Test set sometimes performing slightly better on input features.
  • Training dataset reveals the full scope of missing data, including the most problematic target feature (rougher.output.recovery, 15.26% missing).
  • Output and calculation features consistently drive higher missing rates, while predictor (input) features remain relatively complete.
  • Excluded features explain most severe missing data issues, confirming that the Test set is intentionally designed for clean model evaluation.

Distribution of Missing Values

  • Negligible (< 1%): Full & Train ~71% of features; Test ~85% → most data is very clean.
  • Moderate (1–<10%): Full: ~24% (21); Train: ~22% (19); Test: ~13% (7) → mostly input features; manageable with simple imputation.
  • Severe (≥ 10%): Full: ~5% (4 features), Train: ~7% (6 features) → all outputs/calculations. Test: (N/A) → explains clean profile.

Modeling Impact

  • Predictor features: Low missingness (< 5%) → imputation straightforward and unlikely to distort results.
  • Target/output features: Higher missingness but not used for prediction → no direct risk to training or model reliability.

Missing Data Summary - Side-by-Side Comparison (Target NaN's removed)¶

Dataset Overview

Metric Training Dataset Full Dataset
Missing final.output.recovery values 1,521 1,963
Rows (no NaN values) 15,339 20,753
Data Removed 9.06% 8.64%

Features with ≥ 1% Missing Data - Side-by-Side Comparison

Rank Training Dataset % Missing Full Dataset % Missing Difference
1 secondary_cleaner.output.tail_sol ~11.6% " " ~9.4% Training +2.2%
2 rougher.output.recovery ~7.8% " " ~6.3% Training +1.5%
3 rougher.output.tail_ag ~6.3% " " ~5.1% Training +1.2%
4 rougher.output.tail_sol ~6.3% " " ~5.1% Training +1.2%
5 rougher.output.tail_au ~6.3% " " ~5.1% Training +1.2%
6 rougher.input.floatbank11_xanthate ~5.1% " " ~3.9% Training +1.2%
7 rougher.state.floatbank10_e_air ~3.5% " " ~2.6% Training +0.9%
8 primary_cleaner.output.concentrate_sol ~2.7% " " ~2.5% Training +0.2%
9 primary_cleaner.input.sulfate ~2.5% " " ~1.9% Training +0.6%
10 rougher.input.floatbank10_sulfate ~2.4% " " ~1.8% Training +0.6%
11 rougher.input.floatbank11_sulfate ~2.3% " " ~1.8% Training +0.5%
12 primary_cleaner.input.xanthate ~1.8% " " ~1.4% Training +0.4%
13 final.output.concentrate_sol ~1.7% " " ~1.3% Training +0.4%
14 primary_cleaner.input.depressant ~1.7% " " ~1.3% Training +0.4%
15 secondary_cleaner.state.floatbank2_a_air ~1.5% " " ~1.1% Training +0.4%
16 rougher.input.feed_rate ~1.4% " " ~1.1% Training +0.3%
17 primary_cleaner.output.concentrate_pb ~1.0% Not in ≥1% list <1.0% Training only

Key Patterns

Missing Data Severity

  • Training dataset consistently shows higher missing data rates across all comparable features
  • Differences range from +0.2% to +2.2% with Training having more missing data
  • Training dataset has one additional feature (primary_cleaner.output.concentrate_pb) with ≥1% missing data

Feature Categories

  • Output/Target features show the highest missing rates in both datasets:
    • secondary_cleaner.output.tail_sol (highest in both)
    • rougher.output.* features (consistently problematic)
  • Input features generally have lower missing rates:
    • rougher.input.* and primary_cleaner.input.* features typically <3%
  • State features have minimal missing data (mostly <1%)

Data Quality Impact

  • Training dataset loses slightly more data (9.06% vs 8.64%) when cleaning
  • Both datasets maintain ~85-91% usable data after removing NaN values
  • Output features drive most data loss - these are excluded in prediction tasks anyway

Missing Data Analysis - Cleaned Datasets¶

Features with ≥ 1% Missing Data - Side-by-Side Comparison After removing output/calculation columns and final.output.recovery NaN values

Feature Training Count Training % Full Count Full % Difference
rougher.input.floatbank11_xanthate 779 5.08% 812 3.91% Training +1.17%
rougher.state.floatbank10_e_air 532 3.47% 532 2.56% Training +0.91%
primary_cleaner.input.sulfate 381 2.48% 388 1.87% Training +0.61%
rougher.input.floatbank10_sulfate 375 2.44% 380 1.83% Training +0.61%
rougher.input.floatbank11_sulfate 357 2.33% 368 1.77% Training +0.56%
primary_cleaner.input.xanthate 276 1.80% 282 1.36% Training +0.44%
primary_cleaner.input.depressant 257 1.68% 263 1.27% Training +0.41%
secondary_cleaner.state.floatbank2_a_air 230 1.50% 233 1.12% Training +0.38%
rougher.input.feed_rate 218 1.42% 221 1.06% Training +0.36%

Conclusion Summary

After cleaning the data, only 9 features have missing data issues that need attention. These are all input features required for making predictions, with missing rates between 1-5%. The Training dataset consistently has more missing data than the Full dataset across all features.

The biggest problem is rougher.input.floatbank11_xanthate (5% missing in Training vs 4% in Full). The remaining 8 features have smaller but consistent gaps that will need proper handling during model preparation.

This is manageable but requires planning. Simple deletion isn't appropriate with these missing rates, so imputation strategies will be needed. We've now identified exactly what needs fixing for successful model training.

Missing Data Analysis - Updated Cleaned Datasets¶

Features with ≥ 1% Missing Data - Side-by-Side Comparison

After removing output/calculation columns, final.output.recovery NaN values, AND features with <1% missing data

Dataset Overview

  • Training Dataset: 14,855 rows (after all cleaning)
  • Full Dataset: 20,226 rows (after all cleaning)
Feature Training Count Training % Full Count Full % Difference
rougher.input.floatbank11_xanthate 654 4.40% 684 3.38% Training +1.02%
rougher.state.floatbank10_e_air 508 3.42% 508 2.51% Training +0.91%
rougher.input.floatbank11_sulfate 262 1.76% 272 1.34% Training +0.42%
primary_cleaner.input.sulfate 259 1.74% 263 1.30% Training +0.44%
rougher.input.floatbank10_sulfate 259 1.74% 261 1.29% Training +0.45%
secondary_cleaner.state.floatbank2_a_air 220 1.48% 223 1.10% Training +0.38%
primary_cleaner.input.xanthate 190 1.28% 193 0.95% Training +0.33%
primary_cleaner.input.depressant 189 1.27% 193 0.95% Training +0.32%
rougher.input.feed_rate 183 1.23% 185 0.91% Training +0.32%

Conclusion

Final scope: exactly 9 features need missing data attention. After aggressive cleaning, these are the only features with meaningful missing data gaps. Training dataset consistently shows 0.32-1.02% higher missing rates than Full dataset.

Rougher.input.floatbank11_xanthate remains the main challenge at 4.4% missing - everything else is under 2%. Standard imputation will handle this easily.

Xanthate Difference Analysis: rougher.input.floatbank##_xanthate¶

Distribution: The data appears roughly normally distributed with a slight right skew, centered around 6-7

image.png

After Imputation

image.png

Statistical Summary by Difference Groups

xanthate_difference = rougher.input.floatbank10_xanthate - rougher.input.floatbank11_xanthate

Group Category Variable Median Mean Range
< -1 (174 observations)
floatbank10_xanthate 6.243 4.438 0.001 - 7.625
floatbank11_xanthate 7.494 6.770 1.482 - 8.834
xanthate_difference -1.416 -2.333 -8.005 - -1.001
> 1 (93 observations)
floatbank10_xanthate 5.781 5.703 3.109 - 8.036
floatbank11_xanthate 1.931 2.413 0.000 - 6.318
xanthate_difference 3.204 3.290 1.026 - 7.576
0 to 1 (6,850 observations)
floatbank10_xanthate 6.000 5.908 0.004 - 9.703
floatbank11_xanthate 5.998 5.898 0.002 - 9.698
xanthate_difference 0.002 0.010 0.000 - 0.953
-1 to 0 (7,084 observations)
floatbank10_xanthate 5.995 5.858 0.001 - 9.655
floatbank11_xanthate 6.000 5.902 0.001 - 9.667
xanthate_difference -0.002 -0.044 -1.000 - 0.000
ALL (14,201 observations)
floatbank10_xanthate 5.998 5.864 0.001 - 9.703
floatbank11_xanthate 5.999 5.888 0.000 - 9.698
xanthate_difference -0.000 -0.024 -8.005 - 7.586

Key Observations

  • Most observations fall within small differences: 13,934 out of 14,201 total observations (98.1%) have differences between -1 and +1
  • Extreme negative differences are more common: 174 observations with differences < -1 vs 93 observations with differences > +1
  • Near-equilibrium groups dominate: The "0 to 1" and "-1 to 0" groups contain the vast majority of data points
  • Overall dataset shows slight negative bias: The complete dataset has a mean difference of -0.024, indicating floatbank11_xanthate is slightly higher on average than floatbank10_xanthate
  • Median values are nearly identical across floatbanks: Overall medians of 5.998 vs 5.999 show the datasets are well-balanced at the center
  • Largest extreme difference: -8.005 in the "< -1" group, with maximum positive difference of 7.586
    • Training Dataset Difference Median: -0.000087
  • Full Dataset Difference Median: -0.000075
  • Training and Full Median Difference: 0.000012

Conclusion

Given that the median difference between rougher.input.floatbank10_xanthate and rougher.input.floatbank11_xanthate is approximately 0 (-0.000075), and most observations (98.1%) fall within small differences, imputing missing floatbank11 values using the corresponding floatbank10 values (i.e., floatbank11 = floatbank10) appears reasonable. However, this assumes missing values follow the same near-equilibrium pattern as the majority of the data. This should give the most accurate representation for our model.

The median difference for the Training Set (-0.000075) and Full Set (-0.000087) are virtually the same at 0 (0.000012), validating our imputation approach. Moreover, when observing the datsets after imputation, you see that the Full Dataset changed less than the Training Dataset; this further demonstrates the accuracy of our imputation strategy.

Floatbank Air Difference Analysis (fb10_e_air & fb10_f_air) - Detailed Breakdown**¶

Filtered for floatbank10_f_air between 844-856 to focus on normal operating range for imputation

Before Imputation

image.png

After Imputation

image.png

Overall Difference Metrics: f between 844-856

Metric Value
Length 1,060
Median 0.031
Mean -25.269
Std 129.603
Min -1,072.173
Max 302.433

Categorical Breakdown by Difference Groups: f between 844-856

Group Category Variable Median Mean Min Max
Difference F > 5 (2 observations)
floatbank10_e_air 560.396 560.396 547.540 573.253
floatbank10_f_air 850.177 850.177 849.974 850.381
difference_f 289.781 289.781 277.128 302.433
Difference F Between 1-5 (27 observations)
floatbank10_e_air 849.528 849.566 846.438 852.565
floatbank10_f_air 850.761 850.977 848.870 854.419
difference_f 1.315 1.412 1.014 2.772
Difference F Between 0-1 (257 observations)
floatbank10_e_air 849.780 849.760 845.426 854.668
floatbank10_f_air 850.158 850.135 845.446 855.604
difference_f 0.332 0.374 0.003 1.000
Difference F Between -1 to 0 (215 observations)
floatbank10_e_air 850.287 850.241 845.259 853.225
floatbank10_f_air 849.857 849.874 844.795 852.897
difference_f -0.347 -0.366 -1.000 -0.004
Difference F Between -5 to -1 (24 observations)
floatbank10_e_air 851.440 851.520 848.338 855.630
floatbank10_f_air 850.059 849.903 845.839 852.510
difference_f -1.301 -1.617 -4.910 -1.008
Difference F Between -50 to -5 (3 observations)
floatbank10_e_air 868.354 870.150 865.494 876.603
floatbank10_f_air 850.376 851.789 850.298 854.693
difference_f -18.056 -18.361 -26.226 -10.801
Difference F Between -100 to -50 (2 observations)
floatbank10_e_air 907.761 907.761 904.596 910.927
floatbank10_f_air 852.104 852.104 849.955 854.253
difference_f -55.657 -55.657 -60.972 -50.342
Difference F Between -150 to -100 (0 observations)
floatbank10_e_air N/A N/A N/A N/A
floatbank10_f_air N/A N/A N/A N/A
difference_f N/A N/A N/A N/A
Difference F Between -220 to -150 (1 observation)
floatbank10_e_air 1,004.413 1,004.413 1,004.413 1,004.413
floatbank10_f_air 850.442 850.442 850.442 850.442
difference_f -153.971 -153.971 -153.971 -153.971
Difference F < -200 (23 observations)
floatbank10_e_air 1,502.374 1,470.636 1,097.806 1,922.637
floatbank10_f_air 849.991 849.996 849.365 850.658
difference_f -652.383 -620.640 -1,072.173 -247.191

Overall Difference Metrics: e between 844-856

Metric Value
Length 523
Median 0.061
Mean 0.032
Std 0.677
Min -4.910
Max 2.772

Categorical Breakdown by Difference Groups: e between 844-856

Group Category Variable Median Mean Min Max
Difference F > 0 (284 observations)
floatbank10_e_air 849.741 849.742 845.426 854.668
floatbank10_f_air 850.223 850.215 845.446 855.604
difference_f 0.374 0.473 0.003 2.772
Difference F Between -50 to 0 (239 observations)
floatbank10_e_air 850.352 850.369 845.259 855.630
floatbank10_f_air 849.879 849.877 844.795 852.897
difference_f -0.368 -0.492 -4.910 -0.004
Difference F < -50 (0 observations)
floatbank10_e_air N/A N/A N/A N/A
floatbank10_f_air N/A N/A N/A N/A
difference_f N/A N/A N/A N/A

Key Observations for Imputation Strategy

Filtered Dataset Results (Both e_air and f_air between 844-856):

  • Dataset size reduced: From 1,060 to 523 observations after filtering both variables to normal operating range
  • All extreme outlier categories eliminated: No observations in categories beyond -5 to +5 difference range
  • Improved statistics: Standard deviation dropped from 129.603 to 0.677, mean shifted from -25.269 to 0.032
  • Only normal operating differences remain: 523 observations distributed across -5 to +2.8 range

Distribution in filtered dataset:

  • Difference F > 0: 284 observations (54.3%)
  • Difference F Between -5 to 0: 239 observations (45.7%)
  • All extreme categories (< -5 or > 5): 0 observations

Imputation Strategy Validation:

  • Median difference: 0.061 (very close to 0)
  • Mean difference: 0.032 (very close to 0)
  • Range: -4.91 to +2.77 (all within reasonable sensor variance)

Conclusion:

Filtering both variables to the 844-856 range successfully isolates normal operating conditions. The relationship floatbank10_e_air ≈ floatbank10_f_air (difference ≈ 0) is strongly validated for imputation in this range. Using floatbank10_e_air = floatbank10_f_air + 0.06 or simply floatbank10_e_air = floatbank10_f_air is well-justified for missing values within the normal operating range.

Therefore, reasonable to drop the 2 NaN values where f is not in the 844 - 856 range and fill the rest of the floatbank_e_air NaN values to the same values as floatbank_f_air.

Sulfate Difference Analysis: rougher.input.floatbank10_sulfate vs floatbank11_sulfate¶

sulfate_difference = rougher.input.floatbank10_sulfate - rougher.input.floatbank11_sulfate

Before Imputation - Training

image.png

After Imputation - Training

image.png

Overall Dataset Statistics

Metric floatbank10_sulfate floatbank11_sulfate sulfate_difference
Length 14,853 14,853 14,853
Median 11.708 11.414 0.000085
Mean 11.763 11.389 0.360
Min 0.000044 0.000049 -12.978
Max 36.118 37.981 23.747

Categorical Breakdown by Sulfate11 Range and Difference Groups

Group Category Variable Median Mean Min Max
Sulfate11: 5.9-13.1 & Diff > 1 (14 observations)
floatbank10_sulfate 10.686 11.223 8.495 15.744
floatbank11_sulfate 8.314 8.476 6.321 11.343
sulfate_difference 2.303 2.748 1.003 7.280
Sulfate11: 5.9-13.1 & Diff 0-1 (4,945 observations)
floatbank10_sulfate 10.697 10.443 5.905 13.236
floatbank11_sulfate 10.687 10.435 5.905 13.098
sulfate_difference 0.002 0.008 0.000002 0.998
Sulfate11: 5.9-13.1 & Diff 0-(-1) (4,966 observations)
floatbank10_sulfate 10.629 10.436 5.914 13.095
floatbank11_sulfate 10.645 10.444 5.918 13.096
sulfate_difference -0.002 -0.007 -0.999 -0.000001
Sulfate11: 5.9-13.1 & Diff < -1 (28 observations)
floatbank10_sulfate 5.294 4.919 0.001 11.597
floatbank11_sulfate 8.681 9.475 6.435 13.004
sulfate_difference -3.120 -4.556 -12.978 -1.002
Sulfate11 ≤ 1 & Diff > 1 (402 observations)
floatbank10_sulfate 12.999 13.014 1.240 23.748
floatbank11_sulfate 0.029 0.028 0.000086 0.241
sulfate_difference 12.964 12.986 1.050 23.747
Sulfate11 ≤ 1 & Diff 0-1 (10 observations)
floatbank10_sulfate 0.043 0.307 0.001 1.352
floatbank11_sulfate 0.013 0.221 0.000049 0.961
sulfate_difference 0.023 0.086 0.0002 0.391
Sulfate11 ≤ 1 & Diff 0-(-1) (11 observations)
floatbank10_sulfate 0.009 0.210 0.002 0.676
floatbank11_sulfate 0.159 0.321 0.004 0.830
sulfate_difference -0.034 -0.111 -0.371 -0.001
Sulfate11 ≤ 1 & Diff < -1 (0 observations)
floatbank10_sulfate N/A N/A N/A N/A
floatbank11_sulfate N/A N/A N/A N/A
sulfate_difference N/A N/A N/A N/A

Key Observations

  • Near-equilibrium dominates: 9,911 out of 14,853 observations (66.7%) fall within the -1 to +1 difference range when sulfate11 is in range (5.9-13.1)
  • Low sulfate11 creates large positive differences: 402 observations with sulfate11 ≤ 1 show large positive differences (median: 12.964)
  • Extreme negative differences are rare: Only 28 observations show differences < -1 in the normal sulfate11 range (5.9 - 13.1)
  • Overall relationship is balanced: Median difference of 0.000085 indicates balance
  • Most data concentrated in range: The 5.9-13.1 sulfate11 range contains the majority of reliable data with small differences between sensors

Conclusion

Can safely input values from rougher.input.floatbank11_sulfate when in the range (5.9 - 13.2) into rougher.input.floatbank10_sulfate. The remaining 1.5% (222) missing values, from fb10_sulfate could potentially be from sensor errors and can be dropped to maintain balance.

Imputation Strategy for rougher.input.floatbank10_sulfate: for Test Set¶

Analysis of the relationship between rougher.input.floatbank10_sulfate and rougher.input.floatbank11_sulfate reveals two distinct behavioral patterns based on sulfate 11 concentration, enabling accurate conditional imputation for the test set.

image.pngimage.png


Statistical Comparison

Pattern 1: Sulfate 11 Between 2.0 - 5.9 (Near-Perfect Correlation)

Metric floatbank10_sulfate floatbank11_sulfate Difference Key Finding
Median 4.87 4.84 0.00015 Nearly identical values
Mean 4.72 4.67 0.047 Minimal systematic bias
Min 0.26 2.39 -5.58 Occasional outliers
Max 8.51 5.89 5.41 Occasional outliers

Observations: 280/285 (98.2% for this range)

Pattern 2: Sulfate 11 Between 0.0 - 2.0 (Large Offset)

Metric floatbank10_sulfate floatbank11_sulfate Difference Key Finding
Median 12.99 0.030 12.96 Consistent +13 offset
Mean 12.54 0.040 12.50 Stable relationship
Min 6.00 0.000086 5.81 Lower bound maintained
Max 18.00 1.58 18.00 Upper bound maintained

Observations: 379/428 (88.6% for this range)


Key Findings

  • Conditional relationship identified: The relationship between floatbank 10 and 11 sulfate measurements changes dramatically based on sulfate 11 concentration
  • Pattern 1 (Mid-Range, 2.0-5.9): Median difference of 0.00015 confirms floatbank10 ≈ floatbank11 under normal synchronized conditions
  • Pattern 2 (Low Range, 0.0-2.0): Median difference of 12.96 reveals floatbank 10 maintains a consistent +13 unit offset, suggesting different stages
  • Strong empirical support: 659 total training observations (280 + 379) provide high confidence in pattern reliability
  • High coverage in low range: 88.6% of observations (379/428) in the 0-2 range follow the +12.96 offset pattern

Recommended imputation approach: Apply conditional imputation to the test set based on floatbank11_sulfate value:

  • When sulfate 11 [2.0 - 5.9]: floatbank10_sulfate = floatbank11_sulfate + 0.0
  • When sulfate 11 [0.0 - 2.0]: floatbank10_sulfate = floatbank11_sulfate + 12.96

The strong patterns observed in training data ensure this relationship-based imputation will generalize well to the test set and accurately represent underlying process dynamics.

Imputation Strategy for rougher.input.floatbank11_sulfate: for Test Set¶

Analysis of the relationship between rougher.input.floatbank11_sulfate and rougher.input.floatbank10_sulfate reveals a dominant pattern when sulfate 10 is in the low-to-mid range (0-11), enabling accurate imputation for the test set.

image.png


Statistical Comparison

Pattern 1: Sulfate 10 Between 0 - 11, Difference < 6 (Near-Perfect Correlation)

Metric floatbank11_sulfate floatbank10_sulfate Difference Key Finding
Median 9.43 9.43 -0.00014 Nearly identical values
Mean 8.99 8.97 -0.014 Minimal systematic bias
Min 0.000049 0.0012 -12.98 Occasional outliers
Max 14.50 11.00 5.99 Occasional outliers
Observations 6,076 6,076 - Excellent sample size

Observations: 6,076/6128 - 99.2% in this range

Pattern 2: Sulfate 10 Between 0 - 11, Difference Between 6 - 16 (Large Offset)

Metric floatbank11_sulfate floatbank10_sulfate Difference Key Finding
Median 0.025 10.00 9.97 Floatbank 11 near zero
Mean 0.046 9.46 9.41 Floatbank 11 near zero
Min 0.000086 6.81 6.78 Lower bound maintained
Max 1.22 11.00 10.97 Upper bound maintained

Observations: 52/6128


Key Findings

The relationship between floatbank 11 and 10 sulfate measurements is dominated by a single strong pattern: when sulfate 10 is between 0-11, the measurements are nearly identical (difference ≈ -0.00014). With 6,076 training observations representing 99.2% coverage (6,076/6,128) in this range, the near-perfect correlation provides extremely high confidence for imputation. The secondary pattern (52 observations with difference 6-16) represents less than 1% of cases and involves floatbank 11 values near zero, making it unsuitable for reliable imputation.


Recommended imputation approach: Apply simple imputation to the test set based on floatbank10_sulfate value when it falls between 0-11:

  • floatbank11_sulfate = floatbank10_sulfate - 0.0

This strategy leverages the overwhelmingly dominant pattern (99.2% coverage) where the two measurements are synchronized. The exceptional sample size and near-perfect correlation ensure this imputation will accurately represent the flotation process for virtually all missing values in the test set.

Imputation Strategy for rougher.input.feed_sol: for Test Set¶

Analysis of the relationship between rougher.input.feed_sol and rougher.input.feed_size reveals that the optimal imputation strategy varies across different feed_size ranges. Unlike the sulfate measurements which showed consistent conditional patterns, the feed_sol relationship requires a range-specific approach using either median values or calculated offsets depending on data distribution characteristics.


Statistical Summary by Feed Size Range

Feed Size Range Observations Feed Sol Median Difference Median Imputation Method Rationale
24 - 25.5 10 25.22 2.25 Median (25.22) Small sample; tighter sol range (16-31)
25.5 - 30 33 39.27 -11.22 Median (39.27) Sol cluster 37-41; difference spread -14 to -6
30 - 35 60 36.61 -4.45 Median (36.61) Moderate sample; sol range 30-43
35 - 38 97 30.25 6.91 Offset (+6.9) Difference range tighter than sol range
40 - 50 3,866 34.54 11.81 Offset (+11.8) Large sample; consistent difference pattern
50 - 60 4,927 36.80 17.62 Median (36.80) Large sample; sol values cluster around median
60 - 70 2,259 38.94 25.30 Median (38.94) Large sample; stable median
70 - 75 777 39.41 33.27 Median (39.41) Moderate sample; difference range 28-41
80 - 85 479 40.66 42.25 Median (40.66) Moderate sample; difference range 37-47

Key Findings

The feed_sol and feed_size relationship is more complex than the sulfate measurements, with imputation strategies chosen based on: (1) sample size reliability, (2) whether the sol median or difference showed tighter clustering, and (3) the presence of consistent patterns. Two primary ranges dominate the training data: feed_size 40-50 (3,866 observations) and 50-60 (4,927 observations), representing the most reliable imputation zones. For smaller ranges (24-38), median sol values were preferred due to limited sample sizes and wider difference spreads. For the 35-38 and 40-50 ranges specifically, offset-based imputation was used because the difference showed more consistent patterns than the absolute sol values.


Limitations and Considerations

Unlike the sulfate imputation strategies which had strong conditional relationships (99%+ coverage with near-perfect correlation), the feed_sol imputation is less robust due to:

  • Weaker correlations: Higher variance within ranges, particularly in smaller feed_size ranges
  • Mixed methodology: Combining median and offset strategies introduces inconsistency
  • Limited observations in extremes: Only 10-97 observations in the 24-38 range reduces confidence
  • Wide difference spreads: Some ranges show difference variations of 20+ units, indicating higher uncertainty

This approach represents a pragmatic solution given the data characteristics, prioritizing the use of stable median values from large samples (4,927 and 3,866 observations in the 50-60 and 40-50 ranges) while accepting lower precision in edge cases. The imputation will be adequate for modeling purposes but carries more uncertainty than the sulfate strategies.

Imputation Strategy for rougher.input.feed_sol: for Test Set¶

Analysis of the relationship between rougher.input.feed_sol and rougher.input.feed_size reveals that the optimal imputation strategy varies across different feed_size ranges. Unlike the sulfate measurements which showed consistent conditional patterns, the feed_sol relationship requires a range-specific approach using either median values or calculated offsets depending on data distribution characteristics.

image.pngimage.png

image.pngimage.png

image.pngimage.png

image.pngimage.png


Statistical Summary by Feed Size Range

Feed Size Range Observations Feed Sol Median Difference Median Imputation Method Rationale
24 - 30 (diff -5 to 12) 10 25.22 2.25 Median (25.22) for feed_size < 25.5 Small sample; lower sol values
24 - 30 (diff < -5) 33 39.27 -11.22 Median (39.27) for feed_size ≥ 25.5 Better sample; sol cluster 34-40
30 - 35 60 36.61 -4.45 Median (36.61) Moderate sample; sol range 30-43
35 - 38 97 30.25 6.91 Offset (+6.9) Difference tighter than sol range
40 - 50 3,866 34.54 11.81 Offset (+11.8) Large sample; consistent difference
50 - 60 4,927 36.80 17.62 Median (36.80) Largest sample; stable median
60 - 70 2,259 38.94 25.30 Median (38.94) Large sample; stable median
70 - 75 777 39.41 33.27 Median (39.41) Moderate sample
80 - 85 479 40.66 42.25 Median (40.66) Moderate sample

Key Findings

The feed_sol and feed_size relationship is more complex than the sulfate measurements, with imputation strategies chosen based on: (1) sample size reliability, (2) whether the sol median or difference showed tighter clustering, and (3) the presence of consistent patterns. Two primary ranges dominate the training data: feed_size 40-50 (3,866 observations) and 50-60 (4,927 observations), representing the most reliable imputation zones. For smaller ranges (24-38), median sol values were preferred due to limited sample sizes and wider difference spreads. For the 35-38 and 40-50 ranges specifically, offset-based imputation was used because the difference showed more consistent patterns than the absolute sol values.


Limitations and Considerations

Unlike the sulfate imputation strategies which had strong conditional relationships (99%+ coverage with near-perfect correlation), the feed_sol imputation is less robust due to:

  • Weaker correlations: Higher variance within ranges, particularly in smaller feed_size ranges
  • Mixed methodology: Combining median and offset strategies introduces inconsistency
  • Limited observations in extremes: Only 10-97 observations in the 24-38 range reduces confidence
  • Wide difference spreads: Some ranges show difference variations of 20+ units, indicating higher uncertainty

This approach represents a pragmatic solution given the data characteristics, prioritizing the use of stable median values from large samples (4,927 and 3,866 observations in the 50-60 and 40-50 ranges) while accepting lower precision in edge cases. The imputation will be adequate for modeling purposes but carries more uncertainty than the sulfate strategies.

Imputation Strategy for Remaining Columns (<1% Missing Data: Test Set)¶

For all remaining columns in the test set with less than 1% missing data, a simple median imputation strategy was applied using column-specific medians calculated from the training set.


Methodology

Imputation approach:

  • Calculate the median value for each column from the training set (after row-dropping preprocessing)
  • Apply these training set medians to fill missing values in the corresponding test set columns
  • No conditional logic or range-based strategies required

Rationale:

  • Minimal impact: With <1% missing data per column, the imputation method has negligible effect on model performance
  • Computational efficiency: Simple median imputation is fast and straightforward
  • Adequate accuracy: For such small percentages of missing data, sophisticated methods provide minimal benefit over median imputation
  • Consistency with training: Using training set medians (rather than test set medians) prevents data leakage and maintains proper train-test separation

Key Considerations

For columns with <1% missing data, the additional complexity of conditional imputation is unnecessary. The simple median approach provides a clean, efficient solution that maintains data integrity while having minimal impact on the final model predictions.

Original Dataset: Outlier Analysis and Distribution Assessment¶

Analysis of the original dataset (pre-imputation) reveals significant outliers and wide variance across metal concentration measurements, particularly in gold (Au) processing stages. This assessment examines the raw data distribution before any imputation or outlier handling.


Distribution Summary by Metal Type

Gold (Au) Concentrations:

image.png

Gold (Au) - Extreme outliers and process anomalies:

  • Primary Cleaner Concentrate Au: Maximum 45.93 with high variance (±10.54), showing 41% above 75th percentile (34.77)
  • Secondary Cleaner Tail Au: Maximum 26.81, far exceeding expected near-depletion levels (median 3.96)
  • Final Concentrate Au: Range 0-53.61 represents massive spread; zero values indicate complete process failures
  • Rougher Tail Au: Maximum 9.69 is 5x the median (1.81), suggesting occasional poor separation
Stage Median Mean Std Dev Min Max Key Observation
Rougher Feed 7.88 7.57 3.03 0.00 14.09 Baseline input
Rougher Concentrate 20.00 17.88 6.79 0.00 28.82 2.5x concentration from feed
Rougher Tail 1.81 1.82 0.70 0.02 9.69 Low concentration, good separation
Primary Cleaner Concentrate 32.36 29.21 10.54 0.00 45.93 Highest variance (±10.54)
Primary Cleaner Tail 3.51 3.67 1.99 0.00 18.53 Moderate loss in tail
Secondary Cleaner Tail 3.96 4.04 2.61 0.00 26.81 Higher variance than primary tail
Final Concentrate 44.65 40.00 13.40 0.00 53.61 Maximum enrichment achieved (5.7x feed)
Final Tail 2.91 2.83 1.26 0.00 9.79 Minimal loss, efficient recovery

image.png

Silver (Ag) Concentrations:

image.png

Silver (Ag) - Inverted concentration patterns indicate data quality issues:

  • Primary Cleaner stages show inverted relationship: Tail (15.60) > Concentrate (8.27), opposite of expected behavior
  • Final Concentrate Ag (4.95) is LOWER than feed (8.30): Indicates silver rejection, not concentration
  • Final Tail Ag (9.48) exceeds final concentrate: Confirms poor silver recovery throughout process
  • Primary Cleaner Tail maximum (29.46): Extreme outlier nearly 2x the median (15.60)
Stage Median Mean Std Dev Min Max Key Observation
Rougher Feed 8.30 8.07 3.13 0.00 14.87 Baseline input
Rougher Concentrate 11.79 10.87 4.38 0.00 24.48 1.4x concentration from feed
Rougher Tail 5.76 5.59 1.11 0.59 12.72 Higher than Au tail (less efficient separation)
Primary Cleaner Concentrate 8.27 7.69 3.11 0.00 16.08 Lower than rougher concentrate (unusual)
Primary Cleaner Tail 15.60 14.88 6.54 0.00 29.46 Higher than concentrate (inverted pattern)
Secondary Cleaner Tail 15.22 13.38 5.77 0.00 23.26 Similar to primary cleaner tail
Final Concentrate 4.95 4.78 2.03 0.00 16.00 Low enrichment (~0.6x feed)
Final Tail 9.48 8.92 3.52 0.00 19.55 Higher than concentrate (poor recovery)

image.png

Lead (Pb) Concentrations:

image.png

Lead (Pb) - Moderate outliers with proper concentration trend:

  • Rougher Concentrate Pb: Maximum 18.39 is 2.4x the median (7.57), highest relative outlier
  • Secondary Cleaner Tail Pb: Maximum 17.04 is 3.4x the median (5.07), indicating occasional heavy losses
  • Overall pattern is correct: Concentrate > Tail at each stage, unlike silver
Stage Median Mean Std Dev Min Max Key Observation
Rougher Feed 3.43 3.31 1.45 0.00 7.14 Baseline input (lowest of 3 metals)
Rougher Concentrate 7.57 6.90 2.81 0.00 18.39 2.2x concentration from feed
Rougher Tail 0.59 0.59 0.32 0.00 3.78 Excellent separation (82% reduction)
Primary Cleaner Concentrate 9.92 8.92 3.71 0.00 17.08 Further enrichment to 3x feed
Primary Cleaner Tail 3.15 3.18 1.65 0.00 9.63 Moderate loss in tail
Secondary Cleaner Tail 5.07 5.30 3.09 0.00 17.04 Higher than primary tail
Final Concentrate 9.91 9.10 3.23 0.00 17.03 Maximum enrichment (2.9x feed)
Final Tail 2.65 2.49 1.19 0.00 6.09 Low loss, good overall recovery

image.png


Critical Outlier Observations

Zero values across all metals:

  • Present at minimum for nearly all stages (Au, Ag, Pb)
  • Particularly concerning in concentrate stages where zeros indicate complete process failure
  • May represent measurement errors, sensor failures, or true process shutdowns

Variance patterns:

  • Gold: Highest absolute variance (std dev 10.54-13.40) in cleaner concentrate stages
  • Silver: Primary cleaner tail shows highest relative variance (std dev 6.54, 44% of median 14.88)
  • Lead: Most stable relative to median, except secondary cleaner tail (std dev 3.09, 61% of median 5.07)

Implications for Modeling

Before outlier handling:

  1. Gold (Au) - High variance with valid extreme values:

    • Range 0-53.61 in final concentrate creates modeling challenges but may represent real high-grade ore batches
    • Zero values in concentrate stages likely indicate process failures/measurement errors - should be investigated
    • Primary and secondary cleaner stages show highest variance (10.54-13.40 std dev), requiring robust scaling or transformation
    • Tail outliers (rougher tail max 9.69 vs median 1.81) suggest occasional poor separation events
  2. Silver (Ag) - Inverted patterns indicate process/data issues:

    • Critical problem: Final concentrate (4.95) < Feed (8.30) < Final tail (9.48) shows silver rejection, not recovery
    • Primary cleaner tail > concentrate pattern is physically implausible for a concentration process
    • Either: (a) silver is intentionally rejected to purify gold, (b) measurement/labeling errors exist, or (c) process is malfunctioning
    • High variance in tail stages (std dev 5.77-6.54) with extreme outliers up to 29.46
  3. Lead (Pb) - Proper concentration with moderate outliers:

    • Correct concentration pattern maintained (concentrate > tail at each stage)
    • 2.9x final enrichment from feed shows successful recovery
    • Rougher concentrate maximum (18.39) is 2.4x median, indicating occasional high-grade batches
    • Secondary cleaner tail outliers (max 17.04 vs median 5.07) suggest heavy lead losses in some batches
  4. Cross-metal observations:

    • Scale differences: Gold (0-54), silver (0-29), and lead (0-17) require normalization/scaling
    • Zero minimums: Present across all metals and stages - investigate if these are nulls, true zeros, or measurement failures
    • Boxplot patterns: Gold shows most outliers above upper whisker; silver shows outliers in both tails and concentrates; lead shows moderate outlier density

Recommendations:

  1. Investigate zero values: Determine if these represent missing data, process shutdowns, or true depletion
  2. Verify silver measurements: Inverted concentration patterns suggest labeling errors or intentional rejection
  3. Consider log transformation for gold: Compress the 0-54 range to handle extreme enrichment values
  4. Examine high outliers in context: Cross-reference extreme concentrate values with ore grade data to determine if real vs. anomalous
  5. Flag tail stage outliers: High values in tails (especially Au secondary cleaner tail max 26.81) indicate incomplete separation
  6. Robust scaling recommended: High variance across all metals suggests StandardScaler may amplify outlier influence

Note: This analysis represents the original dataset with NaN values present. Distribution characteristics may change significantly after imputation, particularly for columns with >1% missing data that underwent sophisticated conditional imputation strategies.

Original Dataset Vs. Imputed Dataset: Outlier Analysis and Distribution Assessment¶

Metal Original Median Imputed Median Original Mean Imputed Mean Original Std Imputed Std Original Min Imputed Min Original Max Imputed Max Notes
Au (Gold) 7.88 7.77 7.57 8.01 3.03 1.94 0.00 0.01 14.09 13.90 Less spread and no zeros after imputation — good
Ag (Silver) 8.30 8.28 8.07 8.71 3.13 1.98 0.00 0.01 14.87 14.6 Same pattern — realistic floor now
Pb (Lead) 3.43 3.47 3.31 3.57 1.45 1.11 0.00 0.01 7.14 7.14 Minor tightening — variance reduced

image.pngimage.png

Conclusion

The rougher feed concentrations of Au, Ag, and Pb were compared between the original and imputed datasets. The original data contained several zeros, likely due to missing sensor readings. After imputation, the minimum values increased to 0.01, standard deviations decreased slightly, and overall means remained consistent — indicating improved data integrity without distorting the underlying distributions. Outliers were retained, as they likely reflect genuine variations in ore composition rather than measurement errors.

Feed_Size Distribution: Train Vs. Test¶

Dataset Train Mean Test Mean Train Median Test Median Std Diff Notes
Original 58.68 55.94 54.10 50.00 +1.2 More “natural,” small realistic difference
Imputed 59.92 55.93 55.25 50.11 +0.3 Slightly higher mean in train, but similar pattern

image.pngimage.png

Conclusion

Both the original and imputed datasets were evaluated for feed size distribution consistency between the training and test sets. The original dataset shows slightly more natural variation, while imputation slightly increases the mean feed size in the training set due to smoothing of missing values. However, both datasets maintain comparable distribution shapes and ranges, confirming that train–test distributions are sufficiently aligned for model evaluation.

Total Concentration Analysis¶

Original Dataset

image.png

Stage Count Mean Std Dev Min 25% 50% (Median) 75% Max
Feed 22,471 18.99 7.30 0.00 16.55 19.63 23.62 35.07
Rougher Concentrate 22,618 35.65 13.22 0.00 37.38 39.98 42.19 55.57
Final Concentrate 22,627 53.88 17.70 0.00 58.71 60.08 60.99 65.58

Observations:

  • The original dataset contains zero values in all stages, which may indicate missing or faulty sensor readings.
  • The total concentrations increase logically from feed → rougher → final, but extreme lows (zeros) are anomalous.

Imputed Dataset

image.png

Stage Count Mean Std Dev Min 25% 50% (Median) 75% Max
Feed 14,336 20.29 4.59 0.03 17.00 19.44 23.09 34.83

Observations:

  • After imputation, the feed stage has no zeros; the minimum increased to 0.03.
  • Standard deviation decreased slightly, indicating a tighter distribution.
  • The total feed concentration is more realistic and suitable for modeling without distorting the overall distribution.

Conclusion

Total concentrations at different processing stages were analyzed. The original dataset contained zero and extreme values, particularly in concentrate stages. For modeling, only rougher feed concentrations will be used. In the imputed dataset, zeros in rougher feed were replaced, reducing variance slightly, while preserving meaningful outliers. Other stages were not corrected, but anomalies were noted for process understanding.

sMAPE Analysis¶

Model Final sMAPE (%) Interpretation
Linear Regression 14.64 Achieved strong predictive accuracy, indicating relatively low average percentage error between predicted and actual recovery values.
Decision Tree 22.48 Performed noticeably worse than other models, suggesting higher variance or overfitting to the training data.

| Random Forest | 14.30 | Produced the lowest sMAPE, very close to Linear Regression, indicating robust and reliable performance.


Summary

Among the three models tested, Random Forest achieved the lowest sMAPE (14.30%), narrowly outperforming Linear Regression (14.64%), while the Decision Tree showed significantly higher error (22.48%).

This suggests that both Random Forest and Linear Regression are effective models for predicting gold recovery, with Random Forest offering slightly better accuracy and likely better generalization to unseen data. The Decision Tree model, while simpler, likely overfits and does not generalize as well to the test data.

Cross-Validation Analysis¶

Cross-Validation Results (5-Fold MAE)
Model Average Cross-Validation MAE
Random Forest 6.26
Linear Regression 6.86
Decision Tree N/A (not tested / worse)

Summary

The predictive performance of different models was evaluated using both sMAPE and 5-fold cross-validation with mean absolute error.

Random Forest: Produced the lowest sMAPE (14.30%) and the lowest average cross-validation MAE (6.26), indicating strong predictive accuracy and robustness.

Linear Regression: Achieved similar performance, with slightly higher sMAPE (14.64%) and MAE (6.86), suggesting it is also a reliable model, though marginally less accurate than Random Forest.

Decision Tree: Performed noticeably worse, with higher sMAPE (22.48%), indicating higher variance and overfitting risk.

Overall, Random Forest is the top-performing model, though Linear Regression remains a strong alternative. Decision Tree is not recommended due to its comparatively poorer accuracy.

Final Model Evaluation: Random Forest on True Test Set¶

After initial model comparison, the Random Forest model was selected for final evaluation on the complete test set with true target values. Hyperparameter tuning was applied to optimize performance.


Model Performance Progression

Stage Model Configuration Final sMAPE (%) Rougher MAE Final MAE Key Improvement
Initial Baseline Default Random Forest 14.30% 6.26 6.86 Cross-validation results
True Test Set Default Random Forest 12.28% - - -2.02pp improvement on real data
Hypertuned Model max_depth=5, n_estimators=100 11.48% 6.78 5.51 -0.80pp final improvement

Hyperparameter Tuning Results

Optimal parameters identified through GridSearchCV (5-fold):

Target Variable Best max_depth Best n_estimators Cross-Validation MAE
Final Output Recovery 5 100 5.51
Rougher Output Recovery 5 100 6.78

Both recovery predictions benefited from the same hyperparameter configuration, suggesting consistent optimal model complexity across targets.


Key Findings

Performance improvement through optimization:

  • Initial model (14.30% sMAPE) → True test set (12.28%) → Hypertuned (11.48%)
  • Total improvement: 2.82 percentage points (19.7% relative reduction in error)
  • Hypertuning alone contributed 0.80pp improvement beyond model selection

Model characteristics:

  • Optimal depth (5): Prevents overfitting while capturing complex flotation relationships
  • Optimal estimators (100): Balances ensemble strength with computational efficiency
  • Consistent parameters: Same configuration optimal for both rougher and final recovery predictions

Prediction patterns:

  • Final recovery predictions are more stable (±3% range in sample)
  • Rougher recovery predictions show higher variance (±20% range in sample)
  • This reflects real process behavior: rougher stage is more sensitive to ore variability

Conclusion

The hypertuned Random Forest model achieves a final sMAPE of 11.48% on the true test set, representing strong predictive accuracy for gold recovery optimization. The model successfully balances complexity (max_depth=5) with ensemble power (n_estimators=100), avoiding overfitting while maintaining robust performance on unseen data.

With an average prediction error of approximately 11.5%, the model provides reliable forecasts for both rougher and final stage gold recovery, enabling process optimization decisions with quantifiable confidence intervals.

In [ ]: